Semiautomated relay method and apparatus

ABSTRACT

A relay for captioning a hearing user&#39;s (HU&#39;s) voice signal during a phone call between an HU and a hearing assisted user (AU), the HU using an HU device and the AU using an AU device where the HU voice signal is transmitted from the HU device to the AU device, the relay comprising a display screen, a processor linked to the display and programmed to perform the steps of receiving the HU voice signal from the AU device, transmitting the HU voice signal to a remote automatic speech recognition (ASR) server running ASR software that converts the HU voice signal to ASR generated text, the remote ASR server located at a remote location from the relay, receiving the ASR generated text from the ASR server, present the ASR generated text for viewing by a call assistant (CA) via the display and transmitting the ASR generated text to the AU device.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.15/982,239 which was filed on May 17, 2018, and which is titled“SEMIAUTOMATED RELAY METHOD AND APPARATUS”, which is acontinuation-in-part of U.S. patent application Ser. No. 15/729,069which was filed on Oct. 10, 2017, and titled “SEMIAUTOMATED RELAY METHODAND APPARATUS”, which is a continuation in part of US patent applicationSer. No. 15/171,720, filed on Jun. 2, 2017, and titled “SEMIAUTOMATEDRELAY METHOD AND APPARATUS”, which is a continuation-in-part of U.S.patent application Ser. No. 14/953,631, filed on Nov. 30, 2015, andtitled “SEMIAUTOMATED RELAY METHOD AND APPARATUS”, which is acontinuation-in-part of U.S. patent application Ser. No. 14/632,257,filed on Feb. 26, 2015, and titled “SEMIAUTOMATED RELAY METHOD ANDAPPARATUS”, which claims priority to U.S. provisional patent applicationSer. No. 61/946,072 filed on Feb. 28, 2014, and titled “SEMIAUTOMATEDRELAY METHOD AND APPARATUS”, and claims priority to each of the aboveapplications, each of which is incorporated herein in its entirety byreference.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

Not applicable.

BACKGROUND OF THE DISCLOSURE

The present invention relates to relay systems for providingvoice-to-text captioning for hearing impaired users and morespecifically to a relay system that uses automated voice-to-textcaptioning software to transcribe voice-to-text.

Many people have at least some degree of hearing loss. For instance, inthe United states, about 3 out of every 1000 people are functionallydeaf and about 17 percent (36 million) of American adults report somedegree of hearing loss which typically gets worse as people age. Manypeople with hearing loss have developed ways to cope with the ways theirloss effects their ability to communicate. For instance, many deafpeople have learned to use their sight to compensate for hearing loss byeither communicating via sign language or by reading another person'slips as they speak.

When it comes to remotely communicating using a telephone,unfortunately, there is no way for a hearing impaired person (e.g., anassisted user (AU)) to use sight to compensate for hearing loss asconventional telephones do not enable an AU to see a person on the otherend of the line (e.g., no lip reading or sign viewing). For persons withonly partial hearing impairment, some simply turn up the volume on theirtelephones to try to compensate for their loss and can make do in mostcases. For others with more severe hearing loss conventional telephonescannot compensate for their loss and telephone communication is a pooroption.

An industry has evolved for providing communication services to AUswhereby voice communications from a person linked to an AU'scommunication device are transcribed into text and displayed on anelectronic display screen for the AU to read during a communicationsession. In many cases the AU's device will also broadcast the linkedperson's voice substantially simultaneously as the text is displayed sothat an AU that has some ability to hear can use their hearing sense todiscern most phrases and can refer to the text when some part of acommunication is not understandable from what was heard.

U.S. Pat. No. 6,603,835 (hereinafter “the '835 patent) titled “SystemFor Text Assisted Telephony” teaches several different types of relaysystems for providing text captioning services to AUs. One captioningservice type is referred to as a single line system where a relay islinked between an AU's device and a telephone used by the personcommunicating with the AU. Hereinafter, unless indicated otherwise theother person communicating with the AU will be referred to as a hearinguser (HU) even though the AU may in fact be communicating with anotherAU. In single line systems, one line links an HU device to the relay andone line (e.g., the single line) links the relay to the AU device. Voicefrom the HU is presented to a relay call assistant (CA) who transcribesthe voice-to-text and then the text is transmitted to the AU device tobe displayed. The HU's voice is also, in at least some cases, carried orpassed through the relay to the AU device to be broadcast to the AU.

The other captioning service type described in the '835 patent is a twoline system. In a two line system a HU's telephone is directly linked toan AU's device via a first line for voice communications between the AUand the HU. When captioning is required, the AU can select a captioningcontrol button on the AU device to link to the relay and provide theHU's voice to the relay on a second line. Again, a relay CA listens tothe HU voice message and transcribes the voice message into text whichis transmitted back to the AU device on the second line to be displayedto the AU. One of the primary advantages of the two line system over oneline systems is that the AU can add captioning to an on-going call. Thisis important as many AUs are only partially impaired and may only wantcaptioning when absolutely necessary. The option to not have captioningis also important in cases where an AU device can be used as a normaltelephone and where non-AUs (e.g., a spouse living with an AU that hasgood hearing capability) that do not need captioning may also use the AUdevice.

With any relay system, the primary factors for determining the value ofthe system are accuracy, speed and cost to provide the service.Regarding accuracy, text should accurately represent spoken messagesfrom HUs so that an AU reading the text has an accurate understanding ofthe meaning of the message. Erroneous words provide inaccurate messagesand also can cause confusion for an AU reading transcribed text.

Regarding speed, ideally text is presented to an AU simultaneously withthe voice message corresponding to the text so that an AU sees textassociated with a message as the message is heard. In this regard, textthat trails a voice message by several seconds can cause confusion.Current systems present captioned text relatively quickly (e.g. 1-3seconds after the voice message is broadcast) most of the time. However,at times a CA can fall behind when captioning so that longer delays(e.g., 10-15 seconds) occur.

Regarding cost, existing systems require a unique and highly trained CAfor each communication session. In known cases CAs need to be able tospeak clearly and need to be able to type quickly and accurately. CAjobs are also relatively high pressure jobs and therefore turnover isrelatively high when compared jobs in many other industries whichfurther increases the costs associated with operating a relay.

One innovation that has increased captioning speed appreciably and thathas reduced the costs associated with captioning at least somewhat hasbeen the use of voice-to-text transcription software by relay CAs. Inthis regard, early relay systems required CAs to type all of the textpresented via an AU device. To present text as quickly as possible afterbroadcast of an associated voice message, highly skilled typists wererequired. During normal conversations people routinely speak at a ratebetween 110 and 150 words per minute. During a conversation between anAU and an HU, typically only about half the words voiced have to betranscribed (e.g., the AU typically communicates to the HU during halfof a session). Because of various inefficiencies this means that to keepup with transcribing the HU's portion of a typical conversation a CA hasto be able to type at around 100 words per minute or more. To this end,most professional typists type at around 50 to 80 words per minute andtherefore can keep up with a normal conversation for at least some time.Professional typists are relatively expensive. In addition, despitebeing able to keep up with a conversation most of the time, at othertimes (e.g., during long conversations or during particularly high speedconversations) even professional typists fall behind transcribing realtime text and more substantial delays can occur.

In relay systems that use voice-to-text transcription software trainedto a CA's voice, a CA listens to an HU's voice and revoices the HU'svoice message to a computer running the trained software. The software,being trained to the CA's voice, transcribes the re-voiced message muchmore quickly than a typist can type text and with only minimal errors.In many respects revoicing techniques for generating text are easier andmuch faster to learn than high speed typing and therefore training costsand the general costs associated with CA's are reduced appreciably. Inaddition, because revoicing is much faster than typing in most cases,voice-to-text transcription can be expedited appreciably using revoicingtechniques.

At least some prior systems have contemplated further reducing costsassociated with relay services by replacing CA's with computers runningvoice-to-text software to automatically convert HU voice messages totext. In the past there have been several problems with this solutionwhich have resulted in no one implementing a workable system. First,most voice messages (e.g., an HU's voice message) delivered over mosttelephone lines to a relay are not suitable for direct voice-to-texttranscription software. In this regard, automated transcription softwareon the market has been tuned to work well with a voice signal thatincludes a much larger spectrum of frequencies than the range used intypical phone communications. The frequency range of voice signals onphone lines is typically between 300 and 3000 Hz. Thus, automatedtranscription software does not work well with voice signals deliveredover a telephone line and large numbers of errors occur. Accuracyfurther suffers where noise exists on a telephone line which is a commonoccurrence.

Second, many automated transcription software programs have to betrained to the voice of a speaker to be accurate. When a new HU calls anAU's device, there is no way for a relay to have previously trainedsoftware to the HU voice and therefore the software cannot accuratelygenerate text using the HU voice messages.

Third, many automated transcription software packages use context inorder to generate text from a voice message. To this end, the wordsaround each word in a voice message can be used by software as contextfor determining which word has been uttered. To use words around a firstword to identify the first word, the words around the first word have tobe obtained. For this reason, many automated transcription systems waitto present transcribed text until after subsequent words in a voicemessage have been transcribed so that context can be used to correctprior words before presentation. Systems that hold off on presentingtext to correct using subsequent context cause delay in textpresentation which is inconsistent with the relay system need for realtime or close to real time text delivery.

BRIEF SUMMARY OF THE DISCLOSURE

It has been recognized that a hybrid semi-automated system can beprovided where, when acceptable accuracy can be achieved using automatedtranscription software, the system can automatically use thetranscription software to transcribe HU voice messages to text and whenaccuracy is unacceptable, the system can patch in a human CA totranscribe voice messages to text. Here, it is believed that the numberof CAs required at a large relay facility may be reduced appreciably(e.g., 30% or more) where software can accomplish a large portion oftranscription to text. In this regard, not only is the automatedtranscription software getting better over time, in at least some casesthe software may train to an HU's voice and the vagaries associated withvoice messages received over a phone line (e.g., the limited 300 to 3000Hz range) during a first portion of a call so that during a laterportion of the call accuracy is particularly good. Training may occurwhile and in parallel with a CA manually (e.g., via typing, revoicing,etc.) transcribing voice-to-text and, once accuracy is at an acceptablethreshold level, the system may automatically delink from the CA and usethe text generated by the software to drive the AU display device.

It has been recognized that in a relay system there are at least twoprocessors that may be capable of performing automated voice recognitionprocesses and therefore that can handle the automated voice recognitionpart of a triage process involving a CA. To this end, in most caseseither a relay processor or an AU's device processor may be able toperform the automated transcription portion of a hybrid process. Forinstance, in some cases an AU's device will perform automatedtranscription in parallel with a relay assistant generating CA generatedtext where the relay and AU's device cooperate to provide text andassess when the CA should be cut out of a call with the automated textreplacing the CA generated text.

In other cases where a HU's communication device is a computer orincludes a processor capable of transcribing voice messages to text, aHU's device may generated automated text in parallel with a CAgenerating text and the HU's device and the relay may cooperate toprovide text and determine when the CA should be cut out of the call.

Regardless of which device is performing automated captioning, the CAgenerated text may be used to assess accuracy of the automated text forthe purpose of determining when the CA should be cut out of the call. Inaddition, regardless of which device is performing automated textcaptioning, the CA generated text may be used to train the automatedvoice-to-text software or engine on the fly to expedite the process ofincreasing accuracy until the CA can be cut out of the call.

It has also been recognized that there are times when a hearing impairedperson is listening to a HU's voice without an AU's device providingsimultaneous text when the AU is confused and would like transcriptionof recent voice messages of the HU. For instance, where an AU uses anAU's device to carry on a non-captioned call and the AU has difficultyunderstanding a voice message so that the AU initiates a captioningservice to obtain text for subsequent voice messages. Here, while textis provided for subsequent messages, the AU still cannot obtain anunderstanding of the voice message that prompted initiation ofcaptioning. As another instance, where CA generated text lagsappreciably behind a current HU's voice message, an AU may request thatthe captioning catch up to the current message.

To provide captioning of recent voice messages in these cases, in atleast some embodiments of this disclosure an AU's device stores an HU'svoice messages and, when captioning is initiated or a catch up requestis received, the recorded voice messages are used to eitherautomatically generate text or to have a CA generate text correspondingto the recorded voice messages.

In at least some cases when automated software is trained to a HU'svoice, a voice model for the HU that can be used subsequently to tuneautomated software to transcribe the HU's voice may be stored along witha voice profile for the HU that can be used to distinguish the HU'svoice from other HUs. Thereafter, when the HU calls an AU's deviceagain, the profile can be used to identify the HU and the voice modelcan be used to tune the software so that the automated software canimmediately start generating highly accurate or at least relatively moreaccurate text corresponding to the HU's voice messages.

A relay for captioning a hearing user's (HU's) voice signal during aphone call between an HU and a hearing assisted user (AU), the HU usingan HU device and the AU using an AU device where the HU voice signal istransmitted from the HU device to the AU device, the relay comprising adisplay screen, a processor linked to the display and programmed toperform the steps of receiving the HU voice signal from the AU device,transmitting the HU voice signal to a remote automatic speechrecognition (ASR) server running ASR software that converts the HU voicesignal to ASR generated text, the remote ASR server located at a remotelocation from the relay, receiving the ASR generated text from the ASRserver, present the ASR generated text for viewing by a call assistant(CA) via the display and transmitting the ASR generated text to the AUdevice.

In at least some embodiments the relay further includes an interfacethat enables a CA to make changes to the ASR generated text presented onthe display. In some cases the processor is further programmed totransmit CA corrections made to the ASR generated text to the AU devicewith instructions to modify the ASR generated text previously sent tothe AU device. In some cases the relay separates the HU voice signalinto voice signal slices, the step of transmitting the HU voice signalto the ASR server includes independently transmitting the voice signalslices to the remote ASR server for captioning and wherein the step ofreceiving the ASR generated text from the relay includes receivingseparate ASR generated text segments for each of the slices and cobblingthe separate segments together to form a stream of ASR generated text.

In some cases at least some of the voice signal slices overlap. In somecases at least some of the voice signal slices are relatively short andsome of the voice signal slices are relatively long and wherein theshort voice signal slices are consecutive and do not overlap and whereinat least some relatively long voice signal slices overlap at least firstand second of the relatively short voice signal slices. In some cases atleast some of the ASR generated text associated with overlapping voicesignal slices is inconsistent, the relay applying a rule set to identifywhich inconsistent ASR generated text to use in the stream of ASRgenerated text.

In some cases the ASR server generates ASR error corrections for the ASRgenerated text, the relay further programmed to perform the steps ofreceiving ASR error corrections, using the error corrections toautomatically correct at least some of the errors in the ASR generatedtext on the display screen and transmitting the ASR error corrections tothe AU device. In at least some embodiments the relay further includesan interface that enables a CA to make changes to the ASR generated textpresented on the display, the processor further programmed to transmitCA corrections made to the ASR generated text to the AU device withinstructions to modify the ASR generated text previously sent to the AUdevice. In some cases, after a CA makes a change to ASR generated text,the text prior thereto becomes firm so that no ASR error corrections aremade to the text subsequent thereto.

In some cases the relay further includes a speaker and wherein theprocessor broadcasts the HU voice signal to the CA via the speaker asthe ASR generated text is presented on the display screen. In some casesthe processor aligns broadcast of the HU voice signal with ASR generatedtext presented on the display screen. In some cases the processorpresents the ASR generated text on the on the display screen immediatelyupon reception and transmits the ASR generated text immediately uponreception and broadcasts the HU voice signal under control of the CAusing an interface. In some cases, as word in the HU voice signal isbroadcast to the CA, text corresponding to the broadcast word in on thedisplay screen is visually distinguished from other text on the displayscreen.

Other embodiment include a relay for captioning a hearing user's (HU's)voice signal during a phone call between an HU and a hearing assisteduser (AU), the HU using an HU device and the AU using an AU device wherethe HU voice signal is transmitted from the HU device to the AU device,the relay comprising a display screen, an interface device, a processorlinked to the display screen and the interface device, the processorprogrammed to perform the steps of receiving the HU voice signal fromthe AU device, separating the HU voice signal into voice signal slices,separately transmitting the HU voice signal slices to a remote automaticspeech recognition (ASR) server that is located at a remote locationfrom the relay, receiving separate ASR generated text segments for eachof the slices and cobbling the separate segments together to form astream of ASR generated text, present the stream of ASR generated textas it is received from the ASR server for viewing by a call assistant(CA) via the display and transmitting the stream of ASR generated textto the AU device as the stream is received from the relay.

In some cases ASR error corrections to the ASR generated text arereceived from the ASR server and at least some of the ASR errorcorrections are used to correct the text on the display, the relayreceives CA error corrections to the text on the display and uses thosecorrections to correct text on the display. In some cases, once a CAcorrects an error in the text on the display, ASR error corrections fortext prior to the CA corrected text on the display are not used to makeerror corrections on the display. In some cases all ASR generated textpresented on the display is transmitted to the AU device and all ASRerror corrections and CA text corrections that are presented on thedisplay are transmitted as correction text to the AU device.

Some embodiment include an caption device for use by a hard of hearingassisted user (AU) to assist the AU during voice communications with ahearing user (HU) using an HU device, the caption device comprising adisplay screen, a memory, at least one communication link element forlinking to a communication network, a speaker, a processor linked toeach of the display screen, the memory, the speaker and thecommunication link, the processor programmed to perform the steps ofreceiving an HU voice signal from the HU device during a call,broadcasting the HU voice signal to the AU via the speaker, storing atleast a most recent portion of the HU voice signal in the memory,receiving a command from the AU to start a captioning session, uponreceiving the command, obtaining a text caption corresponding to thestored HU voice signal and presenting the text caption to the AU via thedisplay.

In some cases the step of obtaining a text caption includes initiating aprocess whereby an automated speech recognition (ASR) program convertsthe stored HU voice signal to text. In some cases the processor runs theASR program. In some cases the step of initiating the process includesestablishing a link to a remote relay, and transmitting the stored HUvoice signal to the relay, the step of obtaining further includingreceiving the text caption from the relay. In at least some embodimentsthe relay further includes, subsequent to receiving the command,obtaining text captions for additional HU voice signals received duringthe ongoing call. In some cases the step of obtaining text caption ofthe stored HU voice signal includes initiating a process whereby the HUvoice signal is converted to text via an automatic speech recognition(ASR) engine and wherein the step of obtaining text captions formadditional HU voice signal received during the ongoing call furtherincludes transmitting the additional HU voice signal to a relay andreceiving text captions back from the relay.

To the accomplishment of the foregoing and related ends, the disclosure,then, comprises the features hereinafter fully described. The followingdescription and the annexed drawings set forth in detail certainillustrative aspects of the disclosure. However, these aspects areindicative of but a few of the various ways in which the principles ofthe invention can be employed. Other aspects, advantages and novelfeatures of the disclosure will become apparent from the followingdetailed description of the invention when considered in conjunctionwith the drawings.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

FIG. 1 is a schematic showing various components of a communicationsystem including a relay that may be used to perform various processesand methods according to at least some aspects of the present invention;

FIG. 2 is a schematic of the relay server shown in FIG. 1 ;

FIG. 3 is a flow chart showing a process whereby an automatedvoice-to-text engine is used to generate automated text in parallel witha CA generating text where the automated text is used instead of CAgenerated text to provide captioning an AU's device once an accuracythreshold has been exceeded;

FIG. 4 is a sub-process that maybe substituted for a portion of theprocess shown in FIG. 3 whereby a control assistant can determinewhether or not the automated text takes over the process after theaccuracy threshold has been achieved;

FIG. 5 is a sub-process that may be added to the process shown in FIG. 3wherein, upon an AU's requesting help, a call is linked to a second CAfor correcting the automated text;

FIG. 6 is a process whereby an automated voice-to-text engine is used tofill in text for a HU's voice messages that are skipped over by a CAwhen an AU requests instantaneous captioning of a current message;

FIG. 7 is a process whereby automated text is automatically used to fillin captioning when transcription by a CA lags behind a HU's voicemessages by a threshold duration;

FIG. 8 is a flow chart illustrating a process whereby text is generatedfor a HU's voice messages that precede a request for captioningservices;

FIG. 9 is a flow chart illustrating a process whereby voice messagesprior to a request for captioning service are automatically transcribedto text by an automated voice-to-text engine;

FIG. 10 is a flow chart illustrating a process whereby an AU's deviceprocessor performs transcription processes until a request forcaptioning is received at which point the AU's device presents textsrelated to HU voice messages prior to the request and ongoing voicemessages are transcribed via a relay;

FIG. 11 is a flow chart illustrating a process whereby an AU's deviceprocessor generates automated text for a hear user's voice messageswhich is presented via a display to an AU and also transmits the text toa CA at a relay for correction purposes;

FIG. 12 is a flow chart illustrating a process whereby high definitiondigital voice messages and analog voice messages are handled differentlyat a relay;

FIG. 13 is a process similar to FIG. 12 , albeit where an AU also hasthe option to link to a CA for captioning service regardless of the typeof voice message received;

FIG. 14 is a flow chart that may be substituted for a portion of theprocess shown in FIG. 3 whereby voice models and voice profiles aregenerated for frequent HU's that communicate with an AU where the modelsand profiles can be subsequently used to increase accuracy of atranscription process;

FIG. 15 is a flow chart illustrating a process similar to thesub-process shown in FIG. 14 where voice profiles and voice models aregenerated and stored for subsequent use during transcription;

FIG. 16 is a flow chart illustrating a sub-process that may be added tothe process shown in FIG. 15 where the resulting process calls fortraining of a voice model at each of an AU's device and a relay;

FIG. 17 is a schematic illustrating a screen shot that may be presentedvia an AU's device display screen;

FIG. 18 is similar to FIG. 17 , albeit showing a different screen shot;

FIG. 19 is a process that may be performed by the system shown in FIG. 1where automated text is generated for line check words and is presentedto an AU immediately upon identification of the words;

FIG. 20 is similar to FIG. 17 , albeit showing a different screen shot;

FIG. 21 is a flow chart illustrating a method whereby an automatedvoice-to-text engine is used to identify errors in CA generated textwhich can be highlighted and can be corrected by a CA;

FIG. 22 is an exemplary AU device display screen shot that illustratesvisually distinct text to indicate non-textual characteristics of an HUvoice signal to an AU;

FIG. 23 is an exemplary CA workstation display screen shot that showshow automated ASR text associated with an instantaneously broadcast wordmay be visually distinguished for an error correcting CA;

FIG. 23A is a screen shot of a CA interface providing an option toswitch from ASR generated text to a full CA system where a CA generatescaption text;

FIG. 24 shows an exemplary HU communication device with CA captioned HUtext and ASR generated AU text presented as well as other communicationinformation that is consistent with at least some aspects off thepresent disclosure;

FIG. 25 is an exemplary CA workstation display screen shot similar toFIG. 23 , albeit where a CA has corrected an error and an HU voicesignal playback has been skipped backward as a function of where thecorrection occurred;

FIG. 26 is a screen shot of an exemplary AU device display that presentsCA captioned HU text as well as ASR engine generated AU text;

FIG. 27 is an illustration of an exemplary HU device that shows textcorresponding to the HU's voice signal as well as an indication of whichword in the text has been most recently presented to an AU;

FIG. 28 is a schematic diagram showing a relay captioning system that isconsistent with at least some aspects of the present disclosure;

FIG. 29 is a schematic diagram of a relay system that includes a texttranscription quality assessment function that is consistent with atleast some aspects of the present disclosure;

FIG. 30 is similar to FIG. 29 , albeit showing a different relay systemthat includes a different quality assessment function;

FIG. 31 is similar to FIG. 29 , albeit showing a third relay system thatincludes a third quality assessment function;

FIG. 32 is a flow chart illustrating a method whereby time stamps areassigned to HU voice segments which are then used to substantiallysynchronize text and voice presentation;

FIG. 33 is a schematic illustrating a caption relay system that mayimplement the method illustrated in FIG. 32 as well as other methodsdescribed herein;

FIG. 34 is a sub process that may be substituted for a portion of theFIG. 32 process where an Au device assigns a sequence of time stamps toa sequence of text segments;

FIG. 35 is another flow chart illustrating another method for assigningand using time stamps to synchronize text and HU voice broadcast;

FIG. 36 is a screen shot illustrating a CA interface where a prior wordis selected to be rebroadcast;

FIG. 37 is a screen shot similar to FIG. 36 , albeit of an Au devicedisplay showing an AU selecting a prior broadcast phrase forrebroadcast;

FIG. 38 is another sub process that may be substituted for a portion ofthe FIG. 32 method;

FIG. 39 is a screen shot showing a CA interface where various inventivefeatures are shown;

FIG. 40 is a screen shot illustrating another CA interface where low andhigh confidence text is presented in different columns to help a CA moreeasily distinguish between text likely to need correction and text thatis less likely to need correction;

FIG. 40A is a screen shot of a CA interface showing low confidencecaption text visually distinguished from other text presented to a CAfor correction consideration, among other things;

FIG. 41 is a flow chart illustrating a method of introducing errors inASR generated text to text CA attention;

FIG. 42 is a screen shot illustrating an AU interface including, inaddition to text presentation, an HU video field and a CA signing fieldthat is consistent with at least some aspects of the present disclosure;

FIG. 43 is a screen shot illustrating yet another CA interface;

FIG. 44 is another Au interface screen shot including scrolling text andan HU video window; and

FIG. 45 is another CA interface screen shot showing a CA correctionfield, an ASR uncorrected text field and an intervening time field thatis consistent with at least some aspects of the present disclosure;

FIG. 46 is a schematic illustrating different phrase slices that may beformed that is consistent with at least some aspects of the presentdisclosure;

FIG. 47 is a screen shot illustrating an interface presented to a CAthat includes various transcription feedback tools that are consistentwith various aspects of the present disclosure;

FIG. 48 is a screen shot illustrating an interface presented to an AUthat indicates a transition from automated text to CA generated textthat is consistent with at least some aspects of the present disclosure;

FIG. 49 is similar to FIG. 48 , albeit illustrating an interface thatindicates a transition from automated text to CA corrected text that isconsistent with at least some aspects of the present disclosure;

FIG. 50 is a screen shot showing a CA interface that, among otherthings, enables a CA to select specific points in ASR generated text tofirm up prior ASR generated text;

FIG. 51 is a screen shot illustrating an administrators interface thatshows results of CA generated text and scoring tools used to assessquality of captions generated by a CA;

FIG. 52 is a screen shot illustrating a CA interface where a CA isrestricted to editing text within a small field of recent text to ensurethat the CA keeps up with current HU voice utterances within some windowof time;.

FIG. 53 is similar to FIG. 52 , albeit showing the interface at adifferent point in time;

FIG. 54 is a top plan view of a CA workstation including an eye trackingcamera that is consistent with at least some aspects of some embodimentsof the present disclosure;

FIG. 55 is a schematic illustrating an exemplary CA screen shot and acamera that tracks a CA's eyes that is consistent with at least someaspects of some embodiments of the present disclosure;

FIG. 56 is a screen shot showing an AU interface where a first errorcorrection is shown distinguished in multiple ways;

FIG. 57 is a screen shot similar to FIG. 56 , albeit where the firsterror correction is shown in a less noticeable way and a second errorcorrection is shown distinguished in multiple ways so that thedistinguishing effect related to the first error correction appears tobe extinguishing; and

FIG. 58 is similar to FIGS. 56 and 57 , albeit showing the interfaceafter a third error correction is presented where the first errorcorrection is now shown as normal text, the second is showndistinguished in an extinguishing fashion and the third error correctionis fully dfistinguished.

While the disclosure is susceptible to various modifications andalternative forms, specific embodiments thereof have been shown by wayof example in the drawings and are herein described in detail. It shouldbe understood, however, that the description herein of specificembodiments is not intended to limit the disclosure to the particularforms disclosed, but on the contrary, the intention is to cover allmodifications, equivalents, and alternatives falling within the spiritand scope of the disclosure as defined by the appended claims.

DETAILED DESCRIPTION OF THE DISCLOSURE

The various aspects of the subject disclosure are now described withreference to the annexed drawings, wherein like reference numeralscorrespond to similar elements throughout the several views. It shouldbe understood, however, that the drawings and detailed descriptionhereafter relating thereto are not intended to limit the claimed subjectmatter to the particular form disclosed. Rather, the intention is tocover all modifications, equivalents, and alternatives falling withinthe spirit and scope of the claimed subject matter.

As used herein, the terms “component,” “system” and the like areintended to refer to a computer-related entity, either hardware, acombination of hardware and software, software, or software inexecution. For example, a component may be, but is not limited to being,a process running on a processor, a processor, an object, an executable,a thread of execution, a program, and/or a computer. By way ofillustration, both an application running on a computer and the computercan be a component. One or more components may reside within a processand/or thread of execution and a component may be localized on onecomputer and/or distributed between two or more computers or processors.

The word “exemplary” is used herein to mean serving as an example,instance, or illustration. Any aspect or design described herein as“exemplary” is not necessarily to be construed as preferred oradvantageous over other aspects or designs.

Furthermore, the disclosed subject matter may be implemented as asystem, method, apparatus, or article of manufacture using standardprogramming and/or engineering techniques to produce software, firmware,hardware, or any combination thereof to control a computer or processorbased device to implement aspects detailed herein. The term “article ofmanufacture” (or alternatively, “computer program product”) as usedherein is intended to encompass a computer program accessible from anycomputer-readable device, carrier, or media. For example, computerreadable media can include but are not limited to magnetic storagedevices (e.g., hard disk, floppy disk, magnetic strips . . .), opticaldisks (e.g., compact disk (CD), digital versatile disk (DVD) . . . ),smart cards, solid state drives and flash memory devices (e.g., card,stick). Additionally it should be appreciated that a carrier wave can beemployed to carry computer-readable electronic data such as those usedin transmitting and receiving electronic mail or in accessing a networksuch as the Internet or a local area network (LAN). Of course, thoseskilled in the art will recognize many modifications may be made to thisconfiguration without departing from the scope or spirit of the claimedsubject matter.

Unless indicates otherwise, the phrases “assisted user”, “hearing user”and “call assistant” will be represented by the acronyms “AU”, “HU” and“CA”, respectively. The acronym “ASR” will be used to abbreviate thephrase “automatic speech recognition”. Unless indicated otherwise, thephrase “full CA mode” will be used to refer to a call captioning systeminstantaneously generating captions for at least a portion of acommunication session wherein a voice signal is listened to by a live CA(e.g., a person) who transcribes the voice message to text which the CAthen corrects where the CA generated text is presented to at least oneof the communicants to the communication session and the phrase “ASR-CAbacked up mode” will be used to refer to a call captioning systeminstantaneously generating captions for at least a portion of acommunication session where a voice signal is fed to an ASR softwareengine (e.g., a computer running software) that generates at leastinitial captions for the received voice signal and where a CA correctsthe original captions where the ASR generated captions and in at leastsome cases the CA generated corrections are presented to at least one ofthe communicants to the communication session.

System Architecture

Referring now to the drawings wherein like reference numerals correspondto similar elements throughout the several views and, more specifically,referring to FIG. 1 , the present disclosure will be described in thecontext of an exemplary communication system 10 including an AU'scommunication device 12, an HU's telephone or other type communicationdevice 14, and a relay 16. The AU's device 12 is linked to the HU'sdevice 14 via any network connection capable of facilitating a voicecall between the AU and the HU. For instance, the link may be aconventional telephone line, a network connection such as an internetconnection or other network connection, a wireless connection, etc. AUdevice 12 includes a keyboard 20, a display screen 18 and a handset 22.Keyboard 20 can be used to dial any telephone number to initiate a calland, in at least some cases, includes other keys or may be controlled topresent virtual buttons via screen 18 for controlling various functionsthat will be described in greater detail below. Other identifiers suchas IP addresses or the like may also be used in at least some cases toinitiate a call. Screen 18 includes a flat panel display screen fordisplaying, among other things, text transcribed from a voice message orsignal generated using HU's device 14, control icons or buttons, captionfeedback signals, etc. Handset 22 includes a speaker for broadcasting aHU's voice messages to an AU and a microphone for receiving a voicemessage from an AU for delivery to the HU's device 14. AU device 12 mayalso include a second loud speaker so that device 12 can operate as aspeaker phone type device. Although not shown, device 12 furtherincludes a processor and a memory for storing software run by theprocessor to perform various functions that are consistent with at leastsome aspects of the present disclosure. Device 12 is also linked or islinkable to relay 16 via any communication network including a phonenetwork, a wireless network, the internet or some other similar network,etc. Device 12 may further include a Bluetooth or other type oftransmitter for linking to an AU's hear aide or some other speaker typedevice.

HU's device 14, in at least some embodiments, includes a communicationdevice (e.g., a telephone) including a keyboard for dialing phonenumbers and a handset including a speaker and a microphone forcommunication with other devices. In other embodiments device 14 mayinclude a computer, a smart phone, a smart tablet, etc., that canfacilitate audio communications with other devices. Devices 12 and 14may use any of several different communication protocols includinganalog or digital protocols, a VOIP protocol or others.

Referring still to FIG. 1 , relay 16 includes, among other things, arelay server 30 and a plurality of CA work stations 32, 34, etc. Each ofthe CA work stations 32, 34, etc., is similar and operates in a similarfashion and therefore only station 32 is described here in any detail.Station 32 includes a display screen 50, a keyboard 52 and aheadphone/microphone headset 54. Screen 50 may be any type of electronicdisplay screen for presenting information including text transcribedfrom a HU's voice signal or message. In most cases screen 50 willpresent a graphical user interface with on screen tools for editing textthat appears on the screen. One text editing system is described in U.S.Pat. No. 7,164,753 which issued on Jan. 16, 2007 which is titled “RealTime Transcription Correction System” and which is incorporated hereinin its entirety.

Keyboard 52 is a standard text entry QUERTY type keyboard and can beused to type text or to correct text presented on displays screen 50.Headset 54 includes a speaker in an ear piece and a microphone in amouth piece and is worn by a CA. The headset enables a CA to listen tothe voice of a HU and the microphone enables the CA to speak voicemessages into the relay system such as, for instance, revoiced messagesfrom a HU to be transcribed into text. For instance, typically during acall between a HU on device 14 and an AU on device 12, the HU's voicemessages are presented to a CA via headset 54 and the CA revoices themessages into the relay system using headset 54. Software trained to thevoice of the CA transcribes the assistant's voice messages into textwhich is presented on display screen 50. The CA then uses keyboard 52and/or headset 54 to make corrections to the text on display 50. Thecorrected text is then transmitted to the AU's device 12 for display onscreen 18. In the alternative, the text may be transmitted prior tocorrection to the AU's device 12 for display and corrections may besubsequently transmitted to correct the displayed text via in-linecorrections where errors are replaced by corrected text.

Although not shown, CA work station 32 may also include a foot pedal orother device for controlling the speed with which voice messages areplayed via headset 54 so that the CA can slow or even stop play of themessages while the assistant either catches up on transcription orcorrection of text.

Referring still to FIG. 1 and also to FIG. 2 , server 30 is a computersystem that includes, among other components, at least a first processor56 linked to a memory or database 58 where software run by processor 56to facilitate various functions that are consistent with at least someaspects of the present disclosure is stored. The software stored inmemory 58 includes pre-trained CA voice-to-text transcription software60 for each CA where CA specific software is trained to the voice of anassociated CA thereby increasing the accuracy of transcriptionactivities. For instance, Naturally Speaking continuous speechrecognition software by Dragon, Inc. may be pre-trained to the voice ofa specific CA and then used to transcribe voice messages voiced by theCA into text.

In addition to the CA trained software, a voice-to-text software program62 that is not pre-trained to a CA's voice and instead that trains toany voice on the fly as voice messages are received is stored in memory58. Again, Naturally Speaking software that can train on the fly may beused for this purpose. Hereinafter, the automatic speech recognitionsoftware or system that trains to the HU voices will be referred togenerally as an ASR engine at times.

Moreover, software 64 that automatically performs one of severaldifferent types of triage processes to generate text from voice messagesaccurately, quickly and in a relatively cost effective manner is storedin memory 58. The triage programs are described in detail hereafter.

One issue with existing relay systems is that each call is relativelyexpensive to facilitate. To this end, in order to meet required accuracystandards for text caption calls, each call requires a dedicated CA.While automated voice-to-text systems that would not require a CA havebeen contemplated, none has been successfully implemented because ofaccuracy and speed problems.

Basic Semi-Automated System

One aspect of the present disclosure is related to a system that issemi-automated wherein a CA is used when accuracy of an automated systemis not at required levels and the assistant is cut out of a callautomatically or manually when accuracy of the automated system meets orexceeds accuracy standards or at the preference of an AU. For instance,in at least some cases a CA will be assigned to every new call linked toa relay and the CA will transcribe voice-to-text as in an existingsystem. Here, however, the difference will be that, during the call, thevoice of a HU will also be processed by server 30 to automaticallytranscribe the HU's voice messages to text (e.g., into “automatedtext”). Server 30 compares corrected text generated by the CA to theautomated text to identify errors in the automated text. Server 30 usesidentified errors to train the automated voice-to-text software to thevoice of the HU. During the beginning of the call the software trains tothe HU's voice and accuracy increases over time as the software trains.At some point the accuracy increases until required accuracy standardsare met. Once accuracy standards are met, server 30 is programmed toautomatically cut out the CA and start transmitting the automated textto the AU's device 12.

In at least some cases, when a CA is cut out of a call, the system mayprovide a “Help” button, an “Assist” button or “Assistance Request” typebutton (see 68 in FIG. 1 ) to an AU so that, if the AU recognizes thatthe automated text has too many errors for some reason, the AU canrequest a link to a CA to increase transcription accuracy (e.g.,generate an assistance request). In some cases the help button may be apersistent mechanical button on the AU's device 12. In the alternative,the help button may be a virtual on screen icon (e.g., see 68 in FIG. 1) and screen 18 may be a touch sensitive screen so that contact with thevirtual button can be sensed. Where the help button is virtual, thebutton may only be presented after the system switches from providing CAgenerated text to an AU's device to providing automated text to the AU'sdevice to avoid confusion (e.g., avoid a case where an AU is alreadyreceiving CA generated text but thinks, because of a help button, thateven better accuracy can be achieved in some fashion). Thus, while CAgenerated text is displayed on an AU's device 12, no “help” button ispresented and after automated text is presented, the “help” button ispresented. After the help button is selected and a CA is re-linked tothe call, the help button is again removed from the AU's device display18 to avoid confusion.

Referring now to FIGS. 2 and 3 , a method or process 70 is illustratedthat may be performed by server 30 to cut out a CA when automated textreaches an accuracy level that meets a standard threshold level.Referring also to FIG. 1 , at block 72, help and auto flags are each setto a zero value. The help flag indicates that an AU has selected a helpor assist button via the AU's device 12 because of a perception that toomany errors are occurring in transcribed text. The auto flag indicatesthat automated text accuracy has exceeded a standard thresholdrequirement. Zero values indicate that the help button has not beenselected and that the standard requirement has yet to be met and onevalues indicate that the button has been selected and that the standardrequirement has been met.

Referring still to FIGS. 1 and 3 , at block 74, during a phone callbetween a HU using device 14 and an AU using device 12, the HU's voicemessages are transmitted to server 30 at relay 16. Upon receiving theHU's voice messages, server 30 checks the auto and help flags at blocks76 and 84, respectively. At least initially the auto flag will be set tozero at block 76 meaning that automated text has not reached theaccuracy standard requirement and therefore control passes down to block78 where the HU's voice messages are provided to a CA. At block 80, theCA listens to the HU's voice messages and generates text correspondingthereto by either typing the messages, revoicing the messages tovoice-to-text transcription software trained to the CA's voice, or acombination of both. Text generated is presented on screen 50 and the CAmakes corrections to the text using keyboard 52 and/or headset 54 atblock 80. At block 82 the CA generated text is transmitted to AU device12 to be displayed for the AU on screen 18.

Referring again to FIGS. 1 and 3 , at block 84, at least initially thehelp flag will be set to zero indicating that the AU has not requestedadditional captioning assistance. In fact, at least initially the “help”button 68 may not be presented to an AU as CA generated text isinitially presented. Where the help flag is zero at block 84, controlpasses to block 86 where the HU's voice messages are fed tovoice-to-text software run by server 30 that has not been previouslytrained to any particular voice. At block 88 the software automaticallyconverts the HU's voice-to-text generating automated text. At block 90,server 30 compares the CA generated text to the automated text toidentify errors in the automated text. At block 92, server 30 uses theerrors to train the voice-to-text software for the HU's voice. In thisregard, for instance, where an error is identified, server 30 modifiesthe software so that the next time the utterance that resulted in theerror occurs, the software will generate the word or words that the CAgenerated for the utterance. Other ways of altering or training thevoice-to-text software are well known in the art and any way of trainingthe software may be used at block 92.

After block 92 control passes to block 94 where server 30 monitors for aselection of the “help” button 68 by the AU. If the help button has notbeen selected, control passes to block 96 where server 30 compares theaccuracy of the automated text to a threshold standard accuracyrequirement. For instance, the standard requirement may require thataccuracy be great than 96% measured over at least a most recentforty-five second period or a most recent 100 words uttered by a HU,whichever is longer. Where accuracy is below the threshold requirement,control passes back up to block 74 where the process described abovecontinues. At block 96, once the accuracy is greater than the thresholdrequirement, control passes to block 98 where the auto flag is set toone indicating that the system should start using the automated text anddelink the CA from the call to free up the assistant to handle adifferent call. A virtual “help” button may also be presented via theAU's display 18 at this time. Next, at block 100, the CA is delinkedfrom the call and at block 102 the processor generated automated text istransmitted to the AU device to be presented on display screen 18.

Referring again to block 74, the HU's voice is continually receivedduring a call and at block 76, once the auto flag has been set to one,the lower portion of the left hand loop including blocks 78, 80 and 82is cut out of the process as control loops back up to block 74.

Referring again to block 94, if, during an automated portion of a callwhen automated text is being presented to the AU, the AU decides thatthere are too many errors in the transcription presented via display 18and the AU selects the “help” button 68 (see again FIG. 1 ), controlpasses to block 104 where the help flag is set to one indicating thatthe AU has requested the assistance of a CA and the auto flag is resetto zero indicating that CA generated text will be used to drive the AU'sdisplay 18 instead of the automated text. Thereafter control passes backup to block 74. Again, at block 76, with the auto flag set to zero thenext time through decision block 76, control passes back down to block78 where the call is again linked to a CA for transcription as describedabove. In addition, the next time through block 84, because the helpflag is set to one, control passes back up to block 74 and the automatedtext loop including blocks 86 through 104 is effectively cut out of therest of the call.

In at least some embodiments, there will be a short delay (e.g., 5 to 10seconds in most cases) between setting the flags at block 104 andstopping use of the automated text so that a new CA can be linked up tothe call and start generating CA generated text prior to halting theautomated text. In these cases, until the CA is linked and generatingtext for at least a few seconds (e.g., 3 seconds), the automated textwill still be used to drive the AU's display 18. The delay may either bea pre-defined delay or may have a case specific duration that isdetermined by server 30 monitoring CA generated text and switching overto the CA generated text once the CA is up to speed.

In some embodiments, prior to delinking a CA from a call at block 100,server 30 may store a CA identifier along with a call identifier for thecall. Thereafter, if an AU requests help at block 94, server 30 may beprogrammed to identify if the CA previously associated with the call isavailable (e.g. not handling another call) and, if so, may re-link tothe CA at block 78. In this manner, if possible, a CA that has at leastsome context for the call can be linked up to restart transcriptionservices.

In some embodiments it is contemplated that after an AU has selected ahelp button to receive call assistance, the call will be completed witha CA on the line. In other cases it is contemplated that server 30 may,when a CA is re-linked to a call, start a second triage process toattempt to delink the CA a second time if a threshold accuracy level isagain achieved. For instance, in some cases, midstream during a call, asecond HU may start communicating with the AU via the HU's device. Forinstance, a child may yield the HU's device 14 to a grandchild that hasa different voice profile causing the AU to request help from a CAbecause of perceived text errors. Here, after the hand back to the CA,server 30 may start training on the grandchild's voice and mayeventually achieve the threshold level required. Once the thresholdagain occurs, the CA may be delinked a second time so that automatedtext is again fed to the AU's device.

As another example text errors in automated text may be caused bytemporary noise in one or more of the lines carrying the HU's voicemessages to relay 16. Here, once the noise clears up, automated text mayagain be a suitable option. Thus, here, after an AU requests CA help,the triage process may again commence and if the threshold accuracylevel is again exceeded, the CA may be delinked and the automated textmay again be used to drive the AU's device 12. While the thresholdaccuracy level may be the same each time through the triage process, inat least some embodiments the accuracy level may be changed each timethrough the process. For instance, the first time through the triageprocess the accuracy threshold may be 96%. The second time through thetriage process the accuracy threshold may be raised to 98%.

In at least some embodiments, when the automated text accuracy exceedsthe standard accuracy threshold, there may be a short transition timeduring which a CA on a call observes automated text while listening to aHU's voice message to manually confirm that the handover from CAgenerated text to automated text is smooth. During this short transitiontime, for instance, the CA may watch the automated text on herworkstation screen 50 and may correct any errors that occur during thetransition. In at least some cases, if the CA perceives that the handoffdoes not work or the quality of the automated text is poor for somereason, the CA may opt to retake control of the transcription process.

One sub-process 120 that may be added to the process shown in FIG. 3 formanaging a CA to automated text handoff is illustrated in FIG. 4 .Referring also to FIGS. 1 and 2 , at block 96 in FIG. 3 , if theaccuracy of the automated text exceeds the accuracy standard thresholdlevel, control may pass to block 122 in FIG. 4 . At block 122, a shortduration transition timer (e.g. 10-15 seconds) is started. At block 124automated text (e.g., text generated by feeding the HU's voice messagesdirectly to voice-to-text software) is presented on the CA's display 50.At block 126 an on screen “Retain Control” icon or virtual button isprovided to the CA via the assistant's display screen 50 which can beselected by the CA to forego the handoff to the automated voice-to-textsoftware. At block 128, if the “Retain Control” icon is selected,control passes to block 132 where the help flag is set to one and thencontrol passes back up to block 76 in FIG. 3 where the CA process forgenerating text continues as described above. At block 128, if the CAdoes not select the “Retain Control” icon, control passes to block 130where the transition timer is checked. If the transition timer has nottimed out control passes back up to block 124. Once the timer times outat block 130, control passes back to block 98 in FIG. 3 where the autoflag is set to one and the CA is delinked from the call.

In at least some embodiments it is contemplated that after voice-to-textsoftware takes over the transcription task and the CA is delinked from acall, server 30 itself may be programmed to sense when transcriptionaccuracy has degraded substantially and the server 30 may cause are-link to a CA to increase accuracy of the text transcription. Forinstance, server 30 may assign a confidence factor to each word in theautomated text based on how confident the server is that the word hasbeen accurately transcribed. The confidence factors over a most recentnumber of words (e.g., 100) or a most recent period (e.g., 45 seconds)may be averaged and the average used to assess an overall confidencefactor for transcription accuracy. Where the confidence factor is belowa threshold level, server 30 may re-link to a CA to increasetranscription accuracy. The automated process for re-linking to a CA maybe used instead of or in addition to the process described above wherebyan AU selects the “help” button to re-link to a CA.

In at least some cases when an AU selects a “help” button to re-link toa CA, partial call assistance may be provided instead of full CAservice. For instance, instead of adding a CA that transcribes a HU'svoice messages and then corrects errors, a CA may be linked only forcorrection purposes. The idea here is that while software trained to aHU's voice may generate some errors, the number of errors after trainingwill still be relatively small in most cases even if objectionable to anAU. In at least some cases CAs may be trained to have different skillsets where highly skilled and relatively more expensive to retain CAsare trained to re-voice HU voice messages and correct the resulting textand less skilled CAs are trained to simply make corrections to automatedtext. Here, initially all calls may be routed to highly skilledrevoicing or “transcribing” CAs and all re-linked calls may be routed toless skilled “corrector” CAs.

A sub-process 134 that may be added to the process of FIG. 3 for routingre-linked calls to a corrector CA is shown in FIG. 5 . Referring also toFIGS. 1 and 3 , at decision block 94, if an AU selects the help button,control may pass to block 136 in FIG. 3 where the call is linked to asecond corrector CA. At block 138 the automated text is presented to thesecond CA via the CA's display 50. At block 140 the second CA listens tothe voice of the HU and observes the automated text and makescorrections to errors perceived in the text. At block 142, server 30transmits the corrected automated text to the AU's device for displayvia screen 18. After block 142 control passes back up to block 76 inFIG. 2 .

Re-Sync and Fill in Text

In some cases where a CA generates text that drives an AU's displayscreen 18 (see again FIG. 1 ), for one reason or another the CA'stranscription to text may fall behind the HU's voice message stream by asubstantial amount. For instance, where a HU is speaking quickly, isusing odd vocabulary, and/or has an unusual accent that is hard tounderstand, CA transcription may fall behind a voice message stream by20 seconds, 40 seconds or more.

In many cases when captioning falls behind, an AU can perceive thatpresented text has fallen far behind broadcast voice messages from a HUbased on memory of recently broadcast voice message content and observedtext. For instance, an AU may recognize that currently displayed textcorresponds to a portion of the broadcast voice message that occurredthirty seconds ago. In other cases some captioning delay indicator maybe presented via an AU's device display 18. For instance, see FIG. 17where captioning delay is indicated in two different ways on a displayscreen 18. First, text 212 indicates an estimated delay in seconds(e.g., 24 second delay). Second, at the end of already transcribed text214, blanks 216 for words already voiced but yet to be transcribed maybe presented to give an AU a sense of how delayed the captioning processhas become.

When an AU perceives that captioning is too far behind or when the usercannot understand a recently broadcast voice message, the AU may wantthe text captioning to skip ahead to the currently broadcast voicemessage. For instance, if an AU had difficulty hearing the most recentfive seconds of a HU's voice message and continues to have difficultyhearing but generally understood the preceding 25 seconds, the AU maywant the captioning process to be re-synced with the current HU's voicemessage so that the AU's understanding of current words is accurate.

Here, however, because the AU could not understand the most recent 5seconds of broadcast voice message, a re-sync with the current voicemessage would leave the AU with at least some void in understanding theconversation (e.g., at least the most recent 5 seconds of misunderstoodvoice message would be lost). To deal with this issue, in at least someembodiments, it is contemplated that server 30 may run automatedvoice-to-text software on a HU's voice message simultaneously with a CAgenerating text from the voice message and, when an AU requests a“catch-up” or “re-sync” of the transcription process to the currentvoice message, server 30 may provide “fill in” automated textcorresponding to the portion of the voice message between the mostrecent CA generated text and the instantaneous voice message which maybe provided to the AU's device for display and also, optionally, to theCA's display screen to maintain context for the CA. In this case, whilethe fill in automated text may have some errors, the fill in text willbe better than no text for the associated period and can be referred toby the AU to better understand the voice messages.

In cases where the fill in text is presented on the CA's display screen,the CA may correct any errors in the fill in text. This correction andany error correction by a CA for that matter may be made prior totransmitting text to the AU's device or subsequent thereto. Wherecorrected text is transmitted to an AU's device subsequent totransmission of the original error prone text, the AU's device correctsthe errors by replacing the erroneous text with the corrected text.

Because it is often the case that AUs will request a re-sync only whenthey have difficulty understanding words, server 30 may only presentautomated fill in text to an AU corresponding to a pre-defined durationperiod (e.g., 8 seconds) that precedes the time when the re-sync requestoccurs. For instance, consistent with the example above where CAcaptioning falls behind by thirty seconds, an AU may only requestre-sync at the end of the most recent five seconds as inability tounderstand the voice message may only be an issue during those fiveseconds. By presenting the most recent eight seconds of automated textto the AU, the user will have the chance to read text corresponding tothe misunderstood voice message without being inundated with a largesegment of automated text to view. Where automated fill in text isprovided to an AU for only a pre-defined duration period, the same textmay be provided for correction to the CA.

Referring now to FIG. 7 , a method 190 by which an AU requests a re-syncof the transcription process to current voice messages when CA generatedtext falls behind current voice messages is illustrated. Referring alsoto FIG. 1 , at block 192 a HU's voice messages are received at relay 16.After block 192, control passes down to each of blocks 194 and 200 wheretwo simultaneous sub-processes occur in parallel. At block 194, the HU'svoice messages are stored in a rolling buffer. The rolling buffer may,for instance, have a two minute duration so that the most recent twominutes of a HU's voice messages are always stored. At block 196, a CAlistens to the HU's voice message and transcribes text corresponding tothe messages via re-voicing to software trained to the CA's voice,typing, etc. At block 198 the CA generated text is transmitted to AU'sdevice 12 to be presented on display screen 18 after which controlpasses back up to block 192. Text correction may occur at block 196 orafter block 198.

Referring again to FIG. 7 , at process block 200, the HU's voice is feddirectly to voice-to-text software run by server 30 which generatesautomated text at block 202. Although not shown in FIG. 7 , after block202, server 30 may compare the automated text to the CA generated textto identify errors and may use those errors to train the software to theHU's voice so that the automated text continues to get more accurate asa call proceeds.

Referring still to FIGS. 1 and 7 , at decision block 204, controller 30monitors for a catch up or re-sync command received via the AU's device12 (e.g., via selection of an on-screen virtual “catch up” button 220,see again FIG. 17 ). Where no catch up or re-sync command has beenreceived, control passes back up to block 192 where the processdescribed above continues to cycle. At block 204, once a re-sync commandhas been received, control passes to block 206 where the buffered voicemessages are skipped and a current voice message is presented to the earof the CA to be transcribed. At block 208 the automated textcorresponding to the skipped voice message segment is filled in to thetext on the CA's screen for context and at block 210 the fill in text istransmitted to the AU's device for display.

Where automated text is filled in upon the occurrence of a catch upprocess, the fill in text may be visually distinguished on the AU'sscreen and/or on the CA's screen. For instance, fill in text may behighlighted, underlined, bolded, shown in a distinct font, etc. Forexample, see FIG. 18 that shows fill in text 222 that is underlined tovisually distinguish. See also that the captioning delay 212 has beenupdated. In some cases, fill in text corresponding to voice messagesthat occur after or within some pre-defined period prior to a re-syncrequest may be distinguished in yet a third way to point out the textcorresponding to the portion of a voice message that the AU most likelyfound interesting (e.g., the portion that prompted selection of there-sync button). For instance, where 24 previous seconds of text arefilled in when a re-sync request is initiated, all 24 seconds of fill intext may be underlined and the 8 seconds of text prior to the re-syncrequest may also be highlighted in yellow. See in FIG. 18 that some ofthe fill in text is shown in a phantom box 226 to indicate highlighting.

In at least some cases it is contemplated that server 30 may beprogrammed to automatically determine when CA generated textsubstantially lags a current voice message from a HU and server 30 mayautomatically skip ahead to re-sync a CA with a current message whileproviding automated fill in text corresponding to intervening voicemessages. For instance, server 30 may recognize when CA generated textis more than thirty seconds behind a current voice message and may skipthe voice messages ahead to the current message while filling inautomated text to fill the gap. In at least some cases this automatedskip ahead process may only occur after at least some (e.g., 2 minutes)training to a HU's voice so ensure that minimal errors are generated inthe fill in text.

A method 150 for automatically skipping to a current voice message in abuffer when a CA falls to far behind is shown in FIG. 6 . Referring alsoto FIG. 1 , at block 152, a HU's voice messages are received at relay16. After block 152, control passes down to each of blocks 154 and 162where two simultaneous sub-processes occur in parallel. At block 154,the HU's voice messages are stored in a rolling buffer. At block 156, aCA listens to the HU's voice message and transcribes text correspondingto the messages via re-voicing to software trained to the CA's voice,typing, etc., after which control passes to block 170.

Referring still to FIG. 6 , at process block 162, the HU's voice is feddirectly to voice-to-text software run by server 30 which generatesautomated text at block 164. Although not shown in FIG. 6 , after block164, server 30 may compare the automated text to the CA generated textto identify errors and may use those errors to train the software to theHU's voice so that the automated text continues to get more accurate asa call proceeds.

Referring still to FIGS. 1 and 6 , at decision block 166, controller 30monitors how far CA text transcription is behind the current voicemessage and compares that value to a threshold value. If the delay isless than the threshold value, control passes down to block 170. If thedelay exceeds the threshold value, control passes to block 168 whereserver 30 uses automated text from block 164 to fill in the CA generatedtext and skips the CA up to the current voice message. After block 168control passes to block 170. At block 170, the text including the CAgenerated text and the fill in text is presented to the CA via displayscreen 50 and the CA makes any corrections to observed errors. At block172, the text is transmitted to AU's device 12 and is displayed onscreen 18. Again, uncorrected text may be transmitted to and displayedon device 12 and corrected text may be subsequently transmitted and usedto correct errors in the prior text in line on device 12. After block172 control passes back up to block 152 where the process describedabove continues to cycle. Automatically generated text to fill in whenskipping forward may be visually distinguished (e.g., highlighted,underlined, etc.)

In at least some cases when automated fill in text is generated, thattext may not be presented to the CA or the AU as a single block andinstead may be doled out at a higher speed than the talking speed of theHU until the text catches up with a current time. To this end, wheretranscription is far behind a current point in a conversation, ifautomated catch up text were generated as an immediate single block, inat least some cases, the earliest text in the block could shoot off aCA's display screen or an AU's display screen so that the CA or the AUwould be unable to view all of the automated catch up text. Instead ofpresenting the automated text as a complete block upon catchup, theautomated catch up text may be presented at a rate that is faster (e.g.,two to three times faster) than the HU's rate of speaking so that catchup is rapid without the oldest catch up text running off the CA's orAU's displays.

In addition to avoiding a case where text shoots off an AU's displayscreen, presenting text in a constant but rapid flow has a better feelto it as the text is not presented in a jerky start and stop fashionwhich can be distracting to an AU trying to follow along as text ispresented.

In other cases, when an AU requests fill in, the system mayautomatically fill in text and only present the most recent 10 secondsor so of the automatic fill in text to the CA for correction so that theAU has corrected text corresponding to a most recent period as quicklyas possible. In many cases where the CA generated text is substantiallydelayed, much of the fill in text would run off a typical AU's devicedisplay screen when presented so making corrections to that text wouldmake little sense as the AU that requests catch up text is typicallymost interested in text associated with the most recent HU voice signal.

Many AU's devices can be used as conventional telephones withoutcaptioning service or as AU devices where captioning is presented andvoice messages are broadcast to an AU. The idea here is that one devicecan be used by hearing impaired persons and persons that have no hearingimpairment and that the overall costs associated with providingcaptioning service can be minimized by only using captioning whennecessary. In many cases even a hearing impaired person may not needcaptioning service all of the time. For instance, a hearing impairedperson may be able to hear the voice of a person that speaks loudlyfairly well but may not be able to hear the voice of another person thatspeaks more softly. In this case, captioning would be required whenspeaking to the person with the soft voice but may not be required whenspeaking to the person with the loud voice. As another instance, animpaired person may hear better when well rested but hear relativelymore poorly when tired so captioning is required only when the person istired. As still another instance, an impaired person may hear well whenthere is minimal noise on a line but may hear poorly if line noiseexceeds some threshold. Again, the impaired person would only needcaptioning some of the time.

To minimize captioning service costs and still enable an impaired personto obtain captioning service whenever needed and even during an ongoingcall, some systems start out all calls with a default setting where anAU's device 12 is used like a normal telephone without captioning. Atany time during an ongoing call, an AU can select either a mechanical orvirtual “Caption” icon or button (see again 68 in FIG. 1 ) to link thecall to a relay, provide a HU's voice messages to the relay and commencecaptioning service. One problem with starting captioning only after anAU experiences problems hearing words is that at least some words (e.g.,words that prompted the AU to select the caption button in the firstplace) typically go unrecognized and therefore the AU is left with avoid in their understanding of a conversation.

One solution to the problem of lost meaning when words are notunderstood just prior to selection of a caption button is to store arolling recordation of a HU's voice messages that can be transcribedsubsequently when the caption button is selected to generate “fill in”text. For instance, the most recent 20 seconds of a HU's voice messagesmay be recorded and then transcribed only if the caption button isselected. The relay generates text for the recorded message eitherautomatically via software or via revoicing or typing by a CA or via acombination of both. In addition, the CA or the automated voicerecognition software starts transcribing current voice messages. Thetext from the recording and the real time messages is transmitted to andpresented via AU's device 12 which should enable the AU to determine themeaning of the previously misunderstood words. In at least someembodiments the rolling recordation of HU's voice messages may bemaintained by the AU's device 12 (see again FIG. 1 ) and thatrecordation may be sent to the relay for immediate transcription uponselection of the caption button.

Referring now to FIG. 8 , a process 230 that may be performed by thesystem of FIG. 1 to provide captioning for voice messages that occurprior to a request for captioning service is illustrated. Referring alsoto FIG. 1 , at block 232 a HU's voice messages are received during acall with an AU at the AU's device 12. At block 234 the AU's device 12stores a most recent 20 seconds of the HU's voice messages on a rollingbasis. The 20 seconds of voice messages are stored without captioninginitially in at least some embodiments. At decision block 236, the AU'sdevice monitors for selection of a captioning button (not shown). If thecaptioning button has not been selected, control passes back up to block232 where blocks 232, 234 and 236 continue to cycle.

Once the caption button has been selected, control passes to block 238where AU's device 12 establishes a communication link to relay 16. Atblock 240 AU's device 12 transmits the stored 20 seconds of the HU'svoice messages along with current ongoing voice messages from the HU torelay 16. At this point a CA and/or software at the relay transcribesthe voice-to-text, corrections are made (or not), and the text istransmitted back to device 12 to be displayed. At block 242 AU's device12 receives the captioned text from the relay 16 and at block 244 thereceived text is displayed or presented on the AU's device display 18.At block 246, in at least some embodiments, text corresponding to the 20seconds of HU voice messages prior to selection of the caption buttonmay be visually distinguished (e.g., highlighted, bolded, underlined,etc.) from other text in some fashion. After block 246 control passesback up to block 232 where the process described above continues tocycle and captioning in substantially real time continues.

Referring to FIG. 9 , a relay server process 270 whereby automatedsoftware transcribes voice messages that occur prior to selection of acaption button and a CA at least initially captions current voicemessages is illustrated. At block 272, after an AU requests captioningservice by selecting a caption button, server 30 receives a HU's voicemessages including current ongoing messages as well as the most recent20 seconds of voice messages that had been stored by AU's device 12 (seeagain FIG. 1 ). After block 27, control passes to each of blocks 274 and278 where two simultaneous processes commence in parallel. At block 274the stored 20 seconds of voice messages are provided to voice-to-textsoftware run by server 30 to generate automated text and at block 276the automated text is transmitted to the AU's device 12 for display. Atblock 278 the current or real time HU's voice messages are provided to aCA and at block 280 the CA transcribes the current voice messages totext. The CA generated text is transmitted to an AU's device at block282 where the text is displayed along with the text transmitted at block276. Thus, here, the AU receives text corresponding to misunderstoodvoice messages that occur just prior to the AU requesting captioning.One other advantage of this system is that when captioning starts, theCA is not starting captioning with an already existing backlog of wordsto transcribe and instead automated software is used to provide theprior text.

In other embodiments, when an AU cannot understand a voice messageduring a normal call and selects a caption button to obtain captioningfor a most recent segment of a HU's voice signal, the system may simplyprovide captions for the most recent 10-20 seconds of the voice signalwithout initiating ongoing automatic or assistance from a CA. Thus,where an AU is only sporadically or periodically unable to hear andunderstand the broadcast HU's voice, the HU may select the captionbutton to obtain periodic captioning when needed. For instance, it isenvisioned that in one case, an AU may participate in a five minute calland may only require captioning during three short 20 second periods. Inthis case, the AU would select the caption button three times, once foreach time that the user is unable to hear the HU's voice signal, and thesystem would generate three bursts of text, one for each of three HUvoice segments just prior to each of the button activation events.

In some cases instead of just presenting captioning for the 20 secondsprior to a caption button activation event, the system may present theprior 20 seconds and a few seconds (e.g. 10) of captioning just afterthe button selection to provide the 20 prior seconds in some context tomake it easier for the AU to understand the overall text.

Third Party Automated Speech Recognition (ASR) and Other ASR Resources

In addition to using a service provided by relay 16 to transcribe storedrolling text, other resources may be used to transcribe the storedrolling text. For instance, in at least some embodiments an AU's devicemay link via the Internet or the like to a third party provider runningautomated speech recognition (ASR) software that can receive voicemessages and transcribe those messages, at least somewhat accurately, totext. In these cases it is contemplated that real time transcriptionwhere accuracy needs to meet a high accuracy standard would still beperformed by a CA or software trained to a specific voice while lessaccuracy sensitive text may be generated by the third party provider, atleast some of the time for free or for a nominal fee, and transmittedback to the AU's device for display.

In other cases, it is contemplated that the AU's device 12 itself mayrun voice-to-text or ASR software to at least somewhat accuratelytranscribe voice messages to text where the text generated by the AU'sdevice would only be provided in cases where accuracy sensitivity isless than normal such as where rolling voice messages prior to selectionof a caption icon to initiate captioning are to be transcribed.

FIG. 10 shows another method 300 for providing text for voice messagesthat occurred prior to a caption request, albeit where an AU's devicegenerates the pre-request text as opposed to a relay. Referring also toFIG. 1 , at block 310 a HU's voice messages are received at an AU'sdevice 12. At block 312, the AU's device 12 runs voice-to-text softwarethat, in at least some embodiments, trains on the fly to the voice of alinked HU and generates caption text.

Here, on the fly training may include assigning a confidence factor toeach automatically transcribed word and only using text that has a highconfidence factor to train a voice model for the HU. For instance, onlytext having a confidence factor greater than 95% may be used forautomatic training purposes. Here, confidence factors may be assignedbased on many different factors or algorithms, many of which are wellknown in the automatic voice recognition art. In this embodiment, atleast initially, the caption text generated by the AU's device 12 is notdisplayed to the AU. At block 314, until the AU requests captioning,control simply routes back up to block 310. Once captioning is requestedby an AU, control passes to block 316 where the text corresponding tothe last 20 seconds generated by the AU's device is presented on theAU's device display 18. Here, while there may be some errors in thedisplayed text, at least some text associated with the most recent voicemessage can be quickly presented and give the AU the opportunity toattempt to understand the voice messages associated therewith. At block318 the AU's device links to a relay and at block 320 the HU's ongoingvoice messages are transmitted to the relay. At block 322, after CAtranscription at the relay, the AU's device receives the transcribedtext from the relay and at block 324 the text is displayed. After block324 control passes back up to block 320 where the sub-loop includingblocks 320, 322 and 324 continues to cycle.

Thus, in the above example, instead of the AU's device storing the last20 seconds of a HU's voice signal and transcribing that voice signal totext after the AU requests transcription, the AU's device constantlyruns an ASR engine behind the scenes to generate automated engine textwhich is stored without initially being presented to the AU. Then, whenthe AU requests captioning or transcription, the most recentlytranscribed text can be presented via the AU's device displayimmediately or via rapid presentation (e.g., sequentially at a speedhigher than the HU's speaking speed).

In at least some cases it is contemplated that voice-to-text softwarerun outside control of the relay may be used to generate at leastinitial text for a HU's voice and that the initial text may be presentedvia an AU's device. Here, because known software still may generate moretext transcription errors than allowed given standard accuracyrequirements in the text captioning industry, a relay correction servicemay be provided. For instance, in addition to presenting texttranscribed by the AU's device via a device display 18, the texttranscribed by the AU's device may also be transmitted to a relay 16 forcorrection. In addition to transmitting the text to the relay, the HU'svoice messages may also be transmitted to the relay so that a CA cancompare the text automatically generated by the AU's device to the HU'svoice messages. At the relay, the CA can listen to the voice of thehearing person and can observe associated text. Any errors in the textcan be corrected and corrected text blocks can be transmitted back tothe AU's device and used for in line correction on the AU's displayscreen.

One advantage to this type of system is that relatively less skilled CAsmay be retained at a lesser cost to perform the CA tasks. A relatedadvantage is that the stress level on CAs may be reduced appreciably byeliminating the need to both transcribe and correct at high speeds andtherefore CA turnover at relays may be appreciably reduced whichultimately reduces costs associated with providing relay services.

A similar system may include an AU's device that links to some otherthird party provider ASR transcription/caption server (e.g., in the“cloud”) to obtain initial captioned text which is immediately displayedto an AU and which is also transmitted to the relay for CA correction.Here, again, the CA corrections may be used by the third party providerto train the software on the fly to the HU's voice. In this case, theAU's device may have three separate links, one to the HU, a second linkto a third party provider server, and a third link to the relay. Inother cases, the relay may create the link to the third party server forASR services. Here, the relay would provide the HU's voice signal to thethird party server, would receive text back from the server to transmitto the AU device and would receive corrections from the CA to transmitto each of the AU device and the third party server. The third partyserver would then use the corrections to train the voice model to the HUvoice and would use the evolving model to continue ASR transcription. Instill other cases the third party ASR may train on an HU's voice signalbased on confidence factors and other training algorithms and completelyindependent of CA corrections.

Referring to FIG. 11 , a method 360 whereby an AU's device transcribes aHU's voice to text and where corrections are made to the text at a relayis illustrated. At block 362 a HU's voice messages are received at anAU's device 12 (see also again FIG. 1 ). At block 364 the AU's deviceruns voice-to-text software to generate text from the received voicemessages and at block 366 the generated text is presented to the AU viadisplay 18. At block 370 the transcribed text is transmitted to therelay 16 and at block 372 the text is presented to a CA via the CA'sdisplay 50. At block 374 the CA corrects the text and at block 376corrected blocks of text are transmitted to the AU's device 12. At block378 the AU's device 12 uses the corrected blocks to correct the texterrors via in line correction. At block 380, the AU's device uses theerrors, the corrected text and the voice messages to train thecaptioning software to the HU's voice.

In some cases instead of having a relay or an AU's device run automatedvoice-to-text transcription software, a HU's device may include aprocessor that runs transcription software to generate textcorresponding to the HU's voice messages. To this end, device 14 may,instead of including a simple telephone, include a computer that can runvarious applications including a voice-to-text program or may link tosome third party real time transcription software program (e.g.,software run on a third party server in the “cloud”(e.g., Watson, GoogleVoice, etc.)) to obtain an initial text transcription substantially inreal time. Here, as in the case where an AU's device runs thetranscription software, the text will often have more errors thanallowed by the standard accuracy requirements.

Again, to correct the errors, the text and the HU's voice messages aretransmitted to relay 16 where a CA listens to the voice messages,observes the text on screen 18 and makes corrections to eliminatetranscription errors. The corrected blocks of text are transmitted tothe AU's device for display. The corrected blocks may also betransmitted back to the HU's device for training the captioning softwareto the HU's voice. In these cases the text transcribed by the HU'sdevice and the HU's voice messages may either be transmitted directlyfrom the HU's device to the relay or may be transmitted to the AU'sdevice 12 and then on to the relay. Where the HU's voice messages andtext are transmitted directly to the relay 16, the voice messages andtext may also be transmitted directly to the AU's device for immediatebroadcast and display and the corrected text blocks may be subsequentlyused for in line correction.

In these cases the caption request option may be supported so that an AUcan initiate captioning during an on-going call at any time by simplytransmitting a signal to the HU's device instructing the HU's device tostart the captioning process. Similarly, in these cases the help requestoption may be supported. Where the help option is facilitated, theautomated text may be presented via the AU's device and, if the AUperceives that too many text errors are being generated, the help buttonmay be selected to cause the HU's device or the AU's device to transmitthe automated text to the relay for CA correction.

One advantage to having a HU's device manage or perform voice-to-texttranscription is that the voice signal being transcribed can be arelatively high quality voice signal. To this end, a standard phonevoice signal has a range of frequencies between 300 and about 3000 Hertzwhich is only a fraction of the frequency range used by mostvoice-to-text transcription programs and therefore, in many cases,automated transcription software does only a poor job of transcribingvoice signals that have passed through a telephone connection. Wheretranscription can occur within a digital signal portion of an overallsystem, the frequency range of voice messages can be optimized forautomated transcription. Thus, where a HU's computer that is all digitalreceives and transcribes voice messages, the frequency range of themessages is relatively large and accuracy can be increased appreciably.Similarly, where a HU's computer can send digital voice messages to athird party transcription server accuracy can be increased appreciably.

Calls of Different Sound Quality Handled Differently

In at least some configurations it is contemplated that the link betweenan AU's device 12 and a HU's device 14 may be either a standard phonetype connection or may be a digital or high definition (HD) connectiondepending on the capabilities of the HU's device that links to the AU'sdevice. Thus, for instance, a first call may be standard quality and asecond call may be high definition audio. Because high definition voicemessages have a greater frequency range and therefore can beautomatically transcribed more accurately than standard definition audiovoice messages in many cases, it has been recognized that a system whereautomated voice-to-text program use is implemented on a case by casebasis depending upon the type of voice message received (e.g., digitalor analog) would be advantageous. For instance, in at least someembodiments, where a relay receives a standard definition voice messagefor transcription, the relay may automatically link to a CA for full CAtranscription service where the CA transcribes and corrects text viarevoicing and keyboard manipulation and where the relay receives a highdefinition digital voice message for transcription, the relay may run anautomated voice-to-text transcription program to generate automatedtext. The automated text may either be immediately corrected by a CA ormay only be corrected by an assistant after a help feature is selectedby an AU as described above.

Referring to FIG. 12 , one process 400 for treating high definitiondigital messages differently than standard definition voice messages isillustrated. Referring also to FIG. 1 , at block 402 a HU's voicemessages are received at a relay 16. At decision block 404, relay server30 determines if the received voice message is a high definition digitalmessage or is a standard definition message (e.g., sometimes and analogmessage). Where a high definition message has been received, controlpasses to block 406 where server 30 runs an automated voice-to-textprogram on the voice messages to generate automated text. At block 408the automated text is transmitted to the AU's device 12 for display.Referring again to block 404, where the HU's voice messages are instandard definition audio, control passes to block 412 where a link to aCA is established so that the HU's voice messages are provided to a CA.At block 414 the CA listens to the voice messages and transcribes themessages into text. Error correction may also be performed at block 414.After block 414, control passes to block 408 where the CA generated textis transmitted to the AU's device 12. Again, in some cases, whenautomated text is presented to an AU, a help button may be presentedthat, when selected causes automated text to be presented to a CA forcorrection. In other cases automated text may be automatically presentedto a CA for correction.

Another system is contemplated where all incoming calls to a relay areinitially assigned to a CA for at least initial captioning where theoption to switch to automated software generated text is only availablewhen the call includes high definition audio and after accuracystandards have been exceeded. Here, all standard definition HU voicemessages would be captioned by a CA from start to finish and any highdefinition calls would cut out the CA when the standard is exceeded.

In at least some cases where an AU's device is capable of runningautomated voice-to-text transcription software, the AU's device 12 maybe programmed to select either automated transcription when a highdefinition digital voice message is received or a relay with a CA when astandard definition voice message is received. Again, where device 12runs an automated text program, CA correction may be automatic or mayonly start when a help button is selected.

FIG. 13 shows a process 430 whereby an AU's device 12 selects eitherautomated voice-to-text software or a CA to transcribe based on the type(e.g., digital or analog) of voice messages received. At block 432 aHU's voice messages are received by an AU's device 12. At decision block434, a processor in device 12 determines if the AU has selected a helpbutton. Initially no help button is selected as no text has beenpresented so at least initially control passes to block 436. At decisionblock 436, the device processor determines if a HU's voice signal thatis received is high definition digital or is standard definition. Wherethe received signal is high definition digital, control passes to block438 where the AU's device processor runs automated voice-to-textsoftware to generate automated text which is then displayed on the AUdevice display 18 at block 440.

Referring still to FIG. 13 , if the help button has been selected atblock 434 or if the received voice messages are in standard definition,control passes to block 442 where a link to a CA at relay 16 isestablished and the HU's voice messages are transmitted to the relay. Atblock 444 the CA listens to the voice messages and generates text and atblock 446 the text is transmitted to the AU's device 12 where the textis displayed at block 440.

HU Recognition and Voice Training

In has been recognized that in many cases most calls facilitated usingan AU's device will be with a small group of other hearing or non-HUs.For instance, in many cases as much as 70 to 80 percent of all calls toan AU's device will be with one of five or fewer HU's devices (e.g.,family, close friends, a primary care physician, etc.). For this reasonit has been recognized that it would be useful to store voice-to-textmodels for at least routine callers that link to an AU's device so thatthe automated voice-to-text training process can either be eliminated orsubstantially expedited. For instance, when an AU initiates a captioningservice, if a previously developed voice model for a HU can beidentified quickly, that model can be used without a new trainingprocess and the switchover from a full service CA to automatedcaptioning may be expedited (e.g., instead of taking a minute or morethe switchover may be accomplished in 15 seconds or less, in the timerequired to recognize or distinguish the HU's voice from other voices).

FIG. 14 shows a sub-process 460 that may be substituted for a portion ofthe process shown in FIG. 3 wherein voice-to-text templates or modelsalong with related voice recognition profiles for callers are stored andused to expedite the handoff to automated transcription. Prior torunning sub-process 460, referring again to FIG. 1 , server 30 is usedto create a voice recognition database for storing HU device identifiersalong with associated voice recognition profiles and associatedvoice-to-text models. A voice recognition profile is a data constructthat can be used to distinguish one voice from others and provideimproved speech to text accuracy.

In the context of the FIG. 1 system, voice recognition profiles areuseful because more than one person may use a HU's device to call an AU.For instance in an exemplary case, an AU's son or daughter-in-law or oneof any of three grandchildren may routinely use device 14 to call an AUand therefore, to access the correct voice-to-text model, server 30needs to distinguish which caller's voice is being received. Thus, inmany cases, the voice recognition database will include several voicerecognition profiles for each HU device identifier (e.g., each HU phonenumber). A voice-to-text model includes parameters that are used tocustomize voice-to-text software for transcribing the voice of anassociated HU to text.

The voice recognition database will include at least one voice model foreach voice profile to be used by server 30 to automate transcriptionwhenever a voice associated with the specific profile is identified.Data in the voice recognition database will be generated on the fly asan AU uses device 12. Thus, initially the voice recognition databasewill include a simple construct with no device identifiers, profiles orvoice models.

Referring still to FIGS. 1 and 14 and now also to FIG. 3 , at decisionblock 84 in FIG. 3 , if the help flag is still zero (e.g., an AU has notrequested CA help to correct automated text errors) control may pass toblock 464 in FIG. 13 where the HU's device identifier (e.g., a phonenumber, an IP address, a serial number of a HU's device, etc.) isreceived by server 30. At block 468 server 30 determines if the HU'sdevice identifier has already been added to the voice recognitiondatabase. If the HU's device identifier does not appear in the database(e.g., the first time the HU's device is used to connect to the AU'sdevice) control passes to block 482 where server 30 uses a generalvoice-to-text program to convert the HU's voice messages to text afterwhich control passes to block 476. At block 476 the server 30 trains avoice-to-text model using transcription errors. Again, the training willinclude comparing CA generated text to automated text to identify errorsand using the errors to adjust model parameters so that the next time aword associated with an error is uttered by the HU, the software willidentify the correct word. At block 478, server 30 trains a voiceprofile for the HU's voice so that the next time the HU calls, a voiceprofile will exist for the specific HU that can be used to identify theHU. At block 480 the server 30 stores the voice profile and voice modelfor the HU along with the HU device identifier for future use afterwhich control passes back up to block 94 in FIG. 3 .

Referring still to FIGS. 1 and 14 , at block 468, if the HU's device isalready represented in the voice recognition database, control passes toblock 470 where server 30 runs voice recognition software on the HU'svoice messages in an attempt to identify a voice profile associated withthe specific HU. At decision block 472, if the HU's voice does not matchone of the previously stored voice profiles associated with the deviceidentifier, control passes to block 482 where the process describedabove continues. At block 472, if the HU's voice matches a previouslystored profile, control passes to block 474 where the voice modelassociated with the matching profile is used to tune the voice-to-textsoftware to be used to generate automated text.

Referring still to FIG. 14 , at blocks 476 and 478, the voice model andvoice profile for the HU are continually trained. Continual trainingenables the system to constantly adjust the model for changes in a HU'svoice that may occur over time or when the HU experiences some physicalcondition (e.g., a cold, a raspy voice) that affects the sound of theirvoice. At block 480, the voice profile and voice model are stored withthe HU device identifier for future use.

In at least some embodiments, server 30 may adaptively change the orderof voice profiles applied to a HU's voice during the voice recognitionprocess. For instance, while server 30 may store five different voiceprofiles for five different HUs that routinely connect to an AU'sdevice, a first of the profiles may be used 80 percent of the time. Inthis case, when captioning is commenced, server 30 may start by usingthe first profile to analyze a HU's voice at block 472 and may cyclethrough the profiles from the most matched to the least matched.

To avoid server 30 having to store a different voice profile and voicemodel for every hearing person that communicates with an AU via device12, in at least some embodiments it is contemplated that server 30 mayonly store models and profiles for a limited number (e.g., 5) offrequent callers. To this end, in at least some cases server 30 willtrack calls and automatically identify the most frequent HU devices usedto link to the AU's device 12 over some rolling period (e.g., 1 month)and may only store models and profiles for the most frequent callers.Here, a separate counter may be maintained for each HU device used tolink to the AU's device over the rolling period and different models andprofiles may be swapped in and out of the stored set based on frequencyof calls.

In other embodiments server 30 may query an AU for some indication thata specific HU is or will be a frequent contact and may add that personto a list for which a model and a profile should be stored for a totalof up to five persons.

While the system described above with respect to FIG. 14 assumes thatthe relay 16 stores and uses voice models and voice profiles that aretrained to HU's voices for subsequent use, in at least some embodimentsit is contemplated that an AU's device 12 processor may maintain and useor at least have access to and use the voice recognition database togenerate automated text without linking to a relay. In this case,because the AU's device runs the software to generate the automatedtext, the software for generating text can be trained any time theuser's device receives a HU's voice messages without linking to a relay.For example, during a call between a HU and an AU on devices 14 and 12,respectively, in FIG. 1 , and prior to an AU requesting captioningservice, the voice messages of even a new HU can be used by the AU'sdevice to train a voice-to-text model and a voice profile for the user.In addition, prior to a caption request, as the model is trained andgets better and better, the model can be used to generate text that canbe used as fill in text (e.g., text corresponding to voice messages thatprecede initiation of the captioning function) when captioning isselected.

FIG. 15 shows a process 500 that may be performed by an AU's device totrain voice models and voice profiles and use those models and profilesto automate text transcription until a help button is selected.Referring also to FIG. 1 , at block 502, an AU's device 12 processorreceives a HU's voice messages as well as an identifier (e.g. a phonenumber) of the HU's device 14. At block 504 the processor determines ifthe AU has selected the help button (e.g., indicating that currentcaptioning includes too many errors). If an AU selects the help buttonat block 504, control passes to block 522 where the AU's device islinked to a CA at relay 16 and the HU's voice is presented to the CA. Atblock 524 the AU's device receives text back from the relay and at block534 the CA generated text is displayed on the AU's device display 18.

Where the help button has not been selected, control passes to block 505where the processor uses the device identifier to determine if the HU'sdevice is represented in the voice recognition database. Where the HU'sdevice is not represented in the database control passes to block 528where the processor uses a general voice-to-text program to convert theHU's voice messages to text after which control passes to block 512.

Referring again to FIGS. 1 and 15 , at block 512 the processoradaptively trains the voice model using perceived errors in theautomated text. To this end, one way to train the voice model is togenerate text phonetically and thereafter perform a context analysis ofeach text word by looking at other words proximate the word to identifyerrors. Another example of using context to identify errors is to lookat several generated text words as a phrase and compare the phrase tosimilar prior phrases that are consistent with how the specific HUstrings words together and identify any discrepancies as possibleerrors. At block 514 a voice profile for the HU is generated from theHU's voice messages so that the HU's voice can be recognized in thefuture. At block 516 the voice model and voice profile for the HU arestored for future use during subsequent calls and then control passes toblock 518 where the process described above continues. Thus, blocks 528,512, 514 and 516 enable the AU's device to train voice models and voiceprofiles for HUs that call in anew where a new voice model can be usedduring an ongoing call and during future calls to provide generallyaccurate transcription.

Referring still to FIGS. 1 and 15 , if the HU's device is alreadyrepresented in the voice recognition database at block 505, controlpasses to block 506 where the processor runs voice recognition softwareon the HU's voice messages in an attempt to identify one of the voiceprofiles associated with the device identifier. At block 508, where novoice profile is recognized, control passes to block 528.

At block 508, if the HU's voice matches one of the stored voiceprofiles, control passes to block 510 where the voice-to-text modelassociated with the matching profile is used to generate automated textfrom the HU's voice messages. Next, at block 518, the AU's deviceprocessor determine if the caption button on the AU's device has beenselected. If captioning has not been selected control passes to block502 where the process continues to cycle. Once captioning has beenrequested, control passes to block 520 where AU's device 12 displays themost recent 10 seconds of automated text and continuing automated texton display 18.

In at least some embodiments it is contemplated that different types ofvoice model training may be performed by different processors within theoverall FIG. 1 system. For instance, while an AU's device is not linkedto a relay, the AU's device cannot use any errors identified by a callassistance at the relay to train a voice model as no CA is generatingerrors. Nevertheless, the AU's device can use context and confidencefactors to identify errors and train a model. Once an AU's device islinked to a relay where a CA corrects errors, the relay server can usethe CA identified errors and corrections to train a voice model whichcan, once sufficiently accurate, be transmitted to the AU's device wherethe new model is substituted for the old content based model or wherethe two models are combined into a single robust model in some fashion.In other cases when an AU's device links to a relay for CA captioning, acontext based voice model generated by the AU's device for the HU may betransmitted to the relay server and used as an initial model to befurther trained using CA identified errors and corrections. In stillother cases CA errors may be provided to the AU's device and used bythat device to further train a context based voice model for the HU.

Referring now to FIG. 16 , a sub-process 550 that may be added to theprocess shown in FIG. 15 whereby an AU's device trains a voice model fora HU using voice message content and a relay server further trains thevoice model generated by the AU's device using CA identified errors isillustrated. Referring also to FIG. 15 , sub-process 550 is intended tobe performed in parallel with block 524 and 534 in FIG. 15 . Thus, afterblock 522, in addition to block 524, control also passes to block 552 inFIG. 16 . At block 552 the voice model for a HU that has been generatedby an AU's device 12 is transmitted to relay 16 and at block 553 thevoice model is used to modify a voice-to-text program at the relay. Atblock 554 the modified voice-to-text program is used to convert the HU'svoice messages to automated text. At block 556 the CA generated text iscompared to the automated text to identify errors. At block 558 theerrors are used to further train the voice model. At block 560, if thevoice model has an accuracy below the required standard, control passesback to block 502 in FIG. 15 where the process described above continuesto cycle. At block 560, once the accuracy exceeds the standardrequirement, control passes to block 562 wherein server 30 transmits thetrained voice model to the AU's device for handling subsequent callsfrom the HU for which the model was trained. At block 564 the new modelis stored in the database maintained by the AU's device.

Referring still to FIG. 16 , in addition to transmitting the trainedmodel to the AU's device at block 562, once the model is accurate enoughto meet the standard requirements, server 30 may perform an automatedprocess to cut out the CA and instead transmit automated text to theAU's device as described above in FIG. 1 . In the alternative, once themodel has been transmitted to the AU's device at block 562, the relaymay be programmed to hand off control to the AU's device which wouldthen use the newly trained and relatively more accurate model to performautomated transcription so that the relay could be disconnected.

Several different concepts and aspects of the present disclosure havebeen described above. It should be understood that many of the conceptsand aspects may be combined in different ways to configure other triagesystems that are more complex. For instance, one exemplary system mayinclude an AU's device that attempts automated captioning with on thefly training first and, when automated captioning by the AU's devicefails (e.g., a help icon is selected by an AU), the AU's device may linkto a third party captioning system via the internet or the like whereanother more sophisticated voice-to-text captioning software is appliedto generate automated text. Here, if the help button is selected asecond time or a “CA” button is selected, the AU's device may link to aCA at the relay for CA captioning with simultaneous voice-to-textsoftware transcription where errors in the automated text are used totrain the software until a threshold accuracy requirement is met. Here,once the accuracy requirement is exceeded, the system may automaticallycut out the CA and switch to the automated text from the relay until thehelp button is again selected. In each of the transcription hand offs,any learning or model training performed by one of the processors in thesystem may be provided to the next processor in the system to be used toexpedite the training process.

Line Check Words

In at least some embodiments an automated voice-to-text engine may beutilized in other ways to further enhance calls handled by a relay. Forinstance, in cases where transcription by a CA lags behind a HU's voicemessages, automated transcription software may be programmed totranscribe text all the time and identify specific words in a HU's voicemessages to be presented via an AU's display immediately when identifiedto help the AU determine when a HU is confused by a communication delay.For instance, assume that transcription by a CA lags a HU's most currentvoice message by 20 seconds and that an AU is relying on the CAgenerated text to communicate with the HU. In this case, because the CAgenerated text lag is substantial, the HU may be confused when the AU'sresponse also lags a similar period and may generate a voice messagequestioning the status of the call. For instance, the HU may utter “Areyou there?” or “Did you hear me?” or “Hello” or “What did you say?”.These phrases and others like them querying call status are referred toherein as “line check words” (LCWs) as the HU is checking the status ofthe call on the line.

If the line check words are not presented until they occurredsequentially in the HU's voice messages, they would be delayed for 20 ormore seconds in the above example. In at least some embodiments it iscontemplated that the automated voice engine may search for line checkwords (e.g., 50 common line check phrases) in a HU's voice messages andpresent the line check words immediately via the AU's device during acall regardless of which words have been transcribed and presented to anAU. The AU, seeing line check words or a phrase can verbally respondthat the captioning service is lagging but catching up so that theparties can avoid or at least minimize confusion. In the alternative, asystem processor may automatically respond to any line check words bybroadcasting a voice message to the HU indicating that transcription islagging and will catch up shortly. The automated message may also bebroadcast to the AU so that the AU is also aware of the HU's situation.

When line check words are presented to an AU the words may be presentedin-line within text being generated by a CA with intermediate blanksrepresenting words yet to be transcribed by the CA. To this end, seeagain FIG. 17 that shows line check words “Are you still there?” in ahighlighting box 590 at the end of intermediate blanks 216 representingwords yet to be transcribed by the CA. Line check words will, in atleast some embodiments, be highlighted on the display or otherwisevisually distinguished. In other embodiments the line check words may belocated at some prominent location on the AU's display screen (e.g., ina line check box or field at the top or bottom of the display screen).

One advantage of using an automated voice engine to only search forspecific words and phrases is that the engine can be tuned for thosewords and will be relatively more accurate than a general purpose enginethat transcribes all words uttered by a HU. In at least some embodimentsthe automated voice engine will be run by an AU's device processor whilein other embodiments the automated voice engine may be run by the relayserver with the line check words transmitted to the AU's deviceimmediately upon generation and identification.

In still other cases where automated text is presented immediately upongeneration to an AU, line check words may be presented in a visuallydistinguished fashion (e.g., highlighted, in different color, as adistinct font, as a uniquely sized font, etc.) so that an AU candistinguish those words from others and, where appropriate, provide aclarifying remark to a confused HU.

Referring now to FIG. 19 , a process 600 that may be performed by anAU's device 12 and a relay to transcribe HU's voice messages and provideline check words immediately to an AU when transcription by a CA lags inillustrated. At block 602 a HU's voice messages are received by an AU'sdevice 12. After block 602 control continues along parallelsub-processes to blocks 604 and 612. At block 604 the AU's deviceprocessor uses an automated voice engine to transcribe the HU's voicemessages to text. Here, it is assumed that the voice engine may generateseveral errors and therefore likely would be insufficient for thepurposes of providing captioning to the AU. The engine, however, isoptimized and trained to caption a set (e.g., 10 to 100) line checkwords and/or phrases which the engine can do extremely accurately. Atblock 606, the AU's device processor searches for line check words inthe automated text. At block 608, if a line check word or phrase is notidentified control passes back up to block 602 where the processcontinues to cycle. At block 608, if a line check word or phrase isidentified, control passes to block 610 where the line check word/phraseis immediately presented (see phrase “Are you still there?” in FIG. 18 )to the AU via display 18 either in-line or in a special location and, inat least some cases, in a visually distinct manner.

Referring still to FIG. 19 , at block 612 the HU's voice messages aresent to a relay for transcription. At block 614, transcribed text isreceived at the AU's device back from the relay. At block 616 the textfrom the relay is used to fill in the intermediate blanks (see againFIG. 17 and also FIG. 18 where text has been filled in) on the AU'sdisplay.

ASR Suggests Errors in CA Generated Text

In at least some embodiments it is contemplated that an automatedvoice-to-text engine may operate all the time and may check for andindicate any potential errors in CA generated text so that the CA candetermine if the errors should be corrected. For instance, in at leastsome cases, the automated voice engine may highlight potential errors inCA generated text on the CA's display screen inviting the CA tocontemplate correcting the potential errors. In these cases the CA wouldhave the final say regarding whether or not a potential error should bealtered.

Consistent with the above comments, see FIG. 20 that shows a screen shotof a CA's display screen where potential errors have been highlighted todistinguish the errors from other text. Exemplary CA generated text isshown at 650 with errors shown in phantom boxes 652, 654 and 656 thatrepresent highlighting. In the illustrated example, exemplary wordsgenerated by an automated voice-to-text engine are also presented to theCA in hovering fields above the potentially erroneous text as shown at658, 660 and 662. Here, a CA can simply touch a suggested correction ina hovering field or use a pointing device such as a mouse controlledcursor to select a presented word to make a correction and replace theerroneous word with the automated text suggested in the hovering field.If a CA instead touches an error, the CA can manually change the word toanother word. If a CA does not touch an error or an associated correctedword, the word remains as originally transcribed by the CA. An “AcceptAll” icon is presented at 669 that can be selected to accept all of thesuggestions presented on a CA's display. All corrected words aretransmitted to an AU's device to be displayed.

Referring to FIG. 21 , a method 700 by which a voice engine generatestext to be compared to CA generated text and for providing a correctioninterface as in FIG. 20 for the CA is illustrated. At block 702 the HU'svoice messages are provided to a relay. After block 702 control followsto two parallel paths to blocks 704 and 716. At block 704 the HU's voicemessages are transcribed into text by an automated voice-to-text enginerun by the relay server before control passes to block 706. At block 716a CA transcribes the HU's voice messages to CA generated text. At block718 the CA generated text is transmitted to the AU's device to bedisplayed. At block 720 the CA generated text is displayed on the CA'sdisplay screen 50 for correction after which control passes to block706.

Referring still to FIG. 21 , at block 706 the relay server compares theCA generated text to the automated text to identify any discrepancies.Where the automated text matches the CA generated text at block 708,control passes back up to block 702 where the process continues. Wherethe automated text does not match the CA generated text at block 708,control passes to block 710 where the server visually distinguishes themismatched text on the CA's display screen 50 and also presentssuggested correct text (e.g., the automated text). Next, at block 712the server monitors for any error corrections by the CA and at block 714if an error has been corrected, the corrected text is transmitted to theAU's device for in-line correction.

In at least some embodiments the relay server may be able to generatesome type of probability or confidence factor related to how likely adiscrepancy between automated and CA generated text is related to a CAerror and may only indicate errors and present suggestions for probableerrors or discrepancies likely to be related to errors. For instance,where an automated text segment is different than an associated CAgenerated text segment but the automated segment makes no sensecontextually in a sentence, the server may not indicate the discrepancyor may not show the automated text segment as an option for correction.The same discrepancy may be shown as a potential error at a differenttime if the automated segment makes contextual sense.

In still other embodiments automated voice-to-text software thatoperates at the same time as a CA to generate text may be trained torecognize words often missed by a CA such as articles, for instance, andto ignore other words that CAs more accurately transcribe.

The particular embodiments disclosed above are illustrative only, as theinvention may be modified and practiced in different but equivalentmanners apparent to those skilled in the art having the benefit of theteachings herein. Furthermore, no limitations are intended to thedetails of construction or design herein shown, other than as describedin the claims below. It is therefore evident that the particularembodiments disclosed above may be altered or modified and all suchvariations are considered within the scope and spirit of the invention.Accordingly, the protection sought herein is as set forth in the claimsbelow.

Thus, the invention is to cover all modifications, equivalents, andalternatives falling within the spirit and scope of the invention asdefined by the following appended claims. For example, while the methodsabove are described as being performed by specific system processors, inat least some cases various method steps may be performed by othersystem processors. For instance, where a HU's voice is recognized andthen a voice model for the recognized HU is employed for voice-to-texttranscription, the voice recognition process may be performed by an AU'sdevice and the identified voice may be indicated to a relay 16 whichthen identifies a related voice model to be used. As another instance, aHU's device may identify a HU's voice and indicate the identity of theHU to the AU's device and/or the relay.

As another example, while the system is described above in the contextof a two line captioning system where one line links an AU's device to aHU's device and a second line links the AU's device to a relay, theconcepts and features described above may be used in any transcriptionsystem including a system where the HU's voice is transmitted directlyto a relay and the relay then transmits transcribed text and the HU'svoice to the AU's device.

As still one other example, while inputs to an AU's device may includemechanical or virtual on screen buttons/icons, in some embodiments otherinputs arrangements may be supported. For instance, in some cases helpor a captioning request may be indicated via a voice input (e.g., verbala request for assistance or for captioning) or via a gesture of sometype (e.g., a specific hand movement in front of a camera or othersensor device that is reserved for commencing captioning).

As another example, in at least some cases where a relay includes firstand second differently trained CAs where first CAs are trained to becapable of transcribing and correcting text and second CAs are onlytrained to be capable of correcting text, a CA may always be on a callbut the automated voice-to-text software may aid in the transcriptionprocess whenever possible to minimize overall costs. For instance, whena call is initially linked to a relay so that a HU's voice is receivedat the relay, the HU's voice may be provided to a first CA fully trainedto transcribe and correct text. Here, voice-to-text software may trainto the HU's voice while the first CA transcribes the text and after thevoice-to-text software accuracy exceeds a threshold, instead ofcompletely cutting out the relay or CA, the automated text may beprovided to a second CA that is only trained to correct errors. Here,after training the automated text should have minimal errors andtherefore even a minimally trained CA should be able to make correctionsto the errors in a timely fashion. In other cases, a first CA assignedto a call may only correct errors in automated voice-to-texttranscription and a fully trained revoicing and correcting CA may onlybe assigned after a help or caption request is received.

In other systems an AU's device processor may run automatedvoice-to-text software to transcribe HU's voice messages and may alsogenerate a confidence factor for each word in the automated text basedon how confident the processor is that the word has been accuratelytranscribed. The confidence factors over a most recent number of words(e.g., 100) or a most recent period (e.g., 45 seconds) may be averagedand the average used to assess an overall confidence factor fortranscription accuracy. Where the confidence factor is below a thresholdlevel, the device processor may link to a relay for more accuratetranscription either via more sophisticated automated voice-to-textsoftware or via a CA. The automated process for linking to a relay maybe used instead of or in addition to the process described above wherebyan AU selects a “caption” button to link to a relay.

User Customized Complex Words

In addition to storing HU voice models, a system may also store otherinformation that could be used when an AU is communicating with specificHU's to increase accuracy of automated voice-to-text software when used.For instance, a specific HU may routinely use complex words from aspecific industry when conversing with an AU. The system software canrecognize when a complex word is corrected by a CA or contextually byautomated software and can store the word and the pronunciation of theword by the specific HU in a HU word list for subsequent use. Then, whenthe specific HU subsequently links to the AU's device to communicatewith the AU, the stored word list for the HU may be accessed and used toautomate transcription. The HU's word list may be stored at a relay, byan AU's device or even by a HU's device where the HU's device has datastoring capability.

In other cases a word list specific to an AU's device (i.e., to an AU)that includes complex or common words routinely used to communicate withthe AU may be generated, stored and updated by the system. This list mayinclude words used on a regular basis by any HU that communicates withan AU. In at least some cases this list or the HU's word lists may bestored on an internet accessible database (e.g., in the “cloud”) so thatthe AU or some other person has the ability to access the list(s) andedit words on the list via an internet portal or some other networkinterface.

Where an HU's complex or hard to spell word list and/or an AU's wordlist is available, when a CA is creating CA generated text (e.g., viarevoicing, typing, etc.), an ASR engine may always operate to search theHU voice signal to recognize when a complex or difficult to spell wordis annunciated and the complex or hard to spell words may beautomatically presented to the CA via the CA display screen in line withthe CA generated text to be considered by the CA. Here, while the CAwould still be able to change the automatically generated complex word,it is expected that CA correction of those words would not occur oftengiven the specialized word lists for the specific communicating parties.

Dialect and Other Basis for Specific Transcription Programs

In still other embodiments various aspects of a HU's voice messages maybe used to select different voice-to-text software programs that areoptimized for voices having different characteristic sets. For instance,there may be different voice-to-text programs optimized for male andfemale voices or for voices having different dialects. Here, systemsoftware may be able to distinguish one dialect from others and selectan optimized voice engine/software program to increase transcriptionaccuracy. Similarly, a system may be able to distinguish a high pitchedvoice from a low pitched voice and select a voice engine accordingly.

In some cases a voice engine may be selected for transcribing a HU'svoice based on the region of a country in which a HU's device resides.For instance, where a HU's device is located in the southern part of theUnited States, an engine optimized for a southern dialect may be usedwhile a device in New England may cause the system to select an engineoptimized for another dialect. Different word lists may also be usedbased on region of a country in which a HU's device resides.

Indicating/Selecting Caption Source

In at least some cases it is contemplated that an AU's device willprovide a text or other indication to an AU to convey how text thatappears on an AU device display 18 is being generated. For instance,when automated voice-to-text software (e.g., an automated voicerecognition (ASR) system) is generating text, the phrase “SoftwareGenerated Text” may be persistently presented (see 729 in FIG. 22 ) atthe top of a display 18 and when CA generated text is presented, thephrase “CA Generated Text” (not illustrated) may be presented. A phrase“CA Corrected Text” (not illustrated) may be presented when automatedText is corrected by a CA.

In some cases a set of virtual buttons (e.g., 68 in FIG. 1 ) ormechanical buttons may be provided via an AU device allowing an AU toselect captioning preferences. For instance, captioning options mayinclude “Automated/Software Generated Text”, “CA Generated Text” (seevirtual selection button 719 in FIG. 22 ) and “CA Corrected Text” (seevirtual selection button 721 in FIG. 22 ). This feature allows an AU topreemptively select a preference in specific cases or to select apreference dynamically during an ongoing call. For example, where an AUknows from past experience that calls with a specific HU result inexcessive automated text errors, the AU could select “CA generated text”to cause CA support to persist during the duration of a call with thespecific HU.

Caption Confidence Indication

In at least some embodiments, automated voice-to-text accuracy may betracked by a system and indicated to any one or a subset of a CA, an AU,and an HU either during CA text generation or during automated textpresentation, or both. Here, the accuracy value may be over the durationof an ongoing call or over a short most recent rolling period or numberof words (e.g., last 30 seconds, last 100 words, etc.), or for a mostrecent HU turn at talking. In some cases two averages, one over a fullcall period and the other over a most recent period, may be indicated.The accuracy values would be provided via the AU device display 18 (see728 in FIG. 22 ) and/or the CA workstation display 50. Where an HUdevice has a display (e.g., a smart phone, a tablet, etc.), the accuracyvalue(s) may be presented via that display in at least some cases. Tothis end, see the smart phone type HU device 800 in FIG. 24 where anaccuracy rate is displayed at 802 for a call with an AU. It is expectedthat seeing a low accuracy value would encourage an HU to try toannunciate words more accurately or slowly to improve the value.

Non-Text Communication Enhancements

Human communication has many different components and the meaningsascribed to text words are only one aspect of that communication. Oneother aspect of human non-text communication includes how words areannunciated which often belies a speakers emotions or other meaning. Forinstance, a simple change in volume while words are being spoken isoften intended to convey a different level of importance. Similarly, theduration over which a word is expressed, the tone or pitch used when aphrase is annunciated, etc., can convey a different meaning. Forinstance, annunciating the word “Yes” quickly can connote a differentmeaning than annunciating the word “Yes” very slowly or such that the“s” sound carries on for a period of a few seconds. A simple text wordrepresentation is devoid of a lot of meaning in an originally spokenphrase in many cases.

In at least some embodiments of the present disclosure it iscontemplated that volume changes, tone, length of annunciation, pitch,etc., of an HU's voice signal may be sensed by automated software andused to change the appearance of or otherwise visually distinguishtranscribed text that is presented to an AU via a device display 18 sothat the AU can more fully understand and participate in a richercommunication session. To this end, see, for instance, the two textualeffects 732 and 734 in AU device text 730 in FIG. 22 where an arroweffect 732 represents a long annunciation period while abolded/italicized effect 734 represents an appreciable change in HUvoice signal volume. Many other non-textual characteristics of an HUvoice signal are contemplated and may be sensed and each may have adifferent appearance. For instance, pitch, speed of speaking, etc., mayall be automatically determined and used to provide effect distinctvisual cues along with the transcribed text.

The visual cues may be automatically provided with or used todistinguish text presented via an AU device display regardless of thesource of the text. For example, in some cases automated text may besupplemented with visual cues to indicate other communicationcharacteristics and in at least some cases even CA generated text may besupplemented with automatically generated visual cues indicating how anHU annunciates various words and phrases. Here, as voice characteristicsare detected for an HU's utterances, software tracks the voicecharacteristics in time and associates those characteristics withspecific text words or phrases generated by the CA. Then, the visualcues for each voice characteristic are used to visually distinguish theassociated words when presented to the AU.

In at least some cases an AU may be able to adjust the degree to whichtext is enhanced via visual cues or even to select preferred visual cuesfor different automatically identified voice characteristics. Forinstance, a specific AU may find fully enabled visual queuing to bedistracting and instead may only want bold capital letter visual queuingwhen an HU's volume level exceeds some threshold value. AU devicepreferences may be set via a display 18 during some type device ofcommissioning process.

In some embodiments it is contemplated that the automated software thatidentifies voice characteristics will adjust or train to an HU's voiceduring the first few seconds of a call and will continue to train tothat voice so that voice characteristic identification is normalized tothe HU's specific voice signal to avoid excessive visual queuing. Here,it has been recognized that some people's voices will have persistentvoice characteristics that would normally be detected as anomalies ifcompared to a voice standard (e.g., a typical male or female voice). Forinstance, a first HU may always speak loudly and therefore, if his voicesignal was compared to an average HU volume level, the voice signalwould exceed the average level most if not all the time. Here, to avoidalways distinguishing the first HU's voice signal with visual queuingindicating a loud voice, the software would use the HU voice signal todetermine that the first HU's voice signal is persistently loud andwould normalize to the loud signal so that words uttered within a rangeof volumes near the persistent loud volume would not be distinguished asloud. Here, if the first HU's voice signal exceeds the range about hispersistent volume level, the exceptionally loud signal may be recognizedas a clear deviation from the persistent volume level for the normalizedvoice and therefore distinguished with a visual queue for the AU whenassociated text is presented. The voice characteristic recognizingsoftware would automatically train to the persistent voicecharacteristics for each HU including for instance, pitch, tone, speedof annunciation, etc., so that persistent voice characteristics ofspecific HU voice signals are not visually distinguished as anomalies.

In at least some cases, as in the case of voice models developed andstored for specific HUs, it is contemplated that HU voice models mayalso be automatically developed and stored for specific HU's forspecifying voice characteristics. For instance, in the above examplewhere a first HU has a particularly loud persistent voice, the volumerange about the first HU's persistent volume as well as other persistentcharacteristics may be determined once during an initial call with an AUand then stored along with a phone number or other HU identifyinginformation in a system database. Here, the next time the first HUcommunicates with an AU via the system, the HU voice characteristicmodel would be automatically accessed and used to detect voicecharacteristic anomalies and to visually distinguish accordingly.

Referring again to FIG. 22 , in addition to changing the appearance oftranscribed text to indicate annunciation qualities or characteristics,other visual cues may be presented. For instance, if an HU persistentlytalks in a volume that is much higher than typical for the HU, a volumeindicator 717 may be presented or visually altered in some fashion toindicate the persistent volume. As another example, a volume indicator715 may be presented above or otherwise spatially proximate any wordannunciated with an unusually high volume. In some cases thedistinguishing visual queue for a specially annunciated word may onlypersist for a short duration (e.g., 3 seconds, until the end of arelated sentence or phrase, for the next 5 words of an utterance, etc.)and then be eliminated. Here, the idea is that the visual queuing issupposed to mimic the effect of an annunciated word or phrase which doesnot persist long term (e.g., the loud effect of a high volume word onlypersists as the word is being annunciated).

The software used to generate the HU voice characteristic models and/orto detect voice anomalies to be visually distinguished may be run viaany of an HU device processor, an AU device processor, a relay processorand a third party operated processor linkable via the internet or someother network. In at least some cases it will be optimal for an HUdevice to develop the HU model for an HU that is associated with thedevice and to store the model and apply the model to the HU's voice todetect anomalies to be visually distinguished for several reasons. Inthis regard, a particularly rich acoustic HU voice signal is availableat the HU device so that anomalies can be better identified in manycases by the HU device as opposed to some processor downstream in thecaptioning process.

Sharing Text with HU

Referring again to FIG. 24 , in at least some embodiments where an HUdevice 800 includes a display screen 801, an HU voice text transcription804 may also be presented via the HU device. Here, an HU viewing thetranscribed text could formulate an independent impression oftranscription accuracy and whether or not a more robust transcriptionprocess (e.g., CA generation of text) is required or would be preferred.In at least some cases a virtual “CA request” button 806 or the like maybe provided on the HU screen for selection so that the HU has theability to initiate CA text transcription and or CA correction of text.Here, an HU device may also allow an HU to switch back to automated textif an accuracy value 802 exceeds some threshold level. Where HU voicecharacteristics are detected, those characteristics may be used tovisually distinguish text at 804 in at least some embodiments.

Captioning Via HU's Device

Where an HU device is a smart phone, a tablet computing device or someother similar device capable of downloading software applications froman application store, it is contemplated that a captioning applicationmay be obtained from an application store for communication with one ormore AU devices 12. For instance, the son or daughter of an AU maydownload the captioning application to be used any time the device usercommunicates with the AU. Here, the captioning application may have anyof the functionality described in this disclosure and may result in amuch better overall system in various ways.

For instance, a captioning application on an HU device may run automatedvoice-to-text software on a digital HU voice signal as described abovewhere that text is provided to the AU device 12 for display and, attimes, to a relay for correction, voice model training, voicecharacteristic model training, etc. As another instance, an HU devicemay train a voice model for an HU any time an HU's voice signal isobtained regardless of whether or not the HU is participating in a callwith an AU. For example, if a dictation application on an HU devicewhich is completely separate from a captioning application is used todictate a letter, the HU voice signal during dictation may be used totrain a general HU voice model for the HU and, more specifically, ageneral model that can be used subsequently by the captioning system orapplication. Similarly, an HU voice signal captured during entry of asearch phrase into a browser or an address into mapping software whichis independent of the captioning application may be used to furthertrain the general voice model for the HU. Here, the general voice modelmay be extremely accurate even before used in by AU captioningapplication. In addition, an accuracy value for an HU's voice model maybe calculated prior to an initial AU communication so that, if theaccuracy value exceeds a high or required accuracy standard, automatedtext transcription may be used for an HU-AU call without requiring CAassistance, at least initially.

For instance, prior to an initial AU call, an HU device processortraining to an HU voice signal may assign confidence factors to textwords automatically transcribed by an ASR engine from HU voice signals.As the software trains to the HU voice, the confidence factor valueswould continue to increase and eventually should exceed some thresholdlevel at which initial captioning during an AU communication would meetaccuracy requirements set by the captioning industry.

As another instance, an HU voice model stored by or accessible by the HUdevice can be used to automatically transcribe text for any AU devicewithout requiring continual redevelopment or teaching of the HU voicemodel. Thus, one HU device may be used to communicate with two separatehearing impaired persons using two different AU devices without eachsub-system redeveloping the HU voice model.

As yet another instance, an HU's smart phone or tablet device running acaptioning application may link directly to each of a relay and an AU'sdevice to provide one or more of the HU voice signal, automated textand/or an HU voice model or voice characteristic model to each. This maybe accomplished through two separate phone lines or via two channels ona single cellular line or via any other combination of two communicationlinks.

In some cases an HU voice model may be generated by a relay or an AU'sdevice or some other entity (e.g., a third party ASR engine provider)over time and the HU voice model may then be stored on the HU device orrendered accessible via that device for subsequent transcription. Inthis case, one robust HU voice model may be developed for an HU by anysystem processor or server independent of the HU device and may then beused with any AU device and relay for captioning purposes.

Assessing/Indicating Communication Characteristics

In still other cases, at least one system processor may monitor andassess line and/or audio conditions associated with a call and maypresent some type of indication to each or a subset of an AU, an HU anda CA to help each or at least one of the parties involved in a call toassess communication quality. For instance, an HU device may be able toindicate to an AU and a CA if the HU device is being used as a speakerphone which could help explain an excessive error rate and help with adecision related to CA captioning involvement. As another instance, anHU's device may independently assess the level of non-HU voice signalnoise being picked up by an HU device microphone and, if the determinednoise level exceeds some threshold value either by itself or in relationto the signal strength of the HU voice signal, may perform somecompensatory or corrective function. For example, one function may be toprovide a signal to the HU indicating that the noise level is high.Another function may be to provide a noise level signal to the CA or theAU which could be indicated on one or both of the displays 50 and 18.Yet another function would be to offer one or more captioning options toany of the HU or AU or even to a text correcting CA when the noise levelexceeds the threshold level. Here, the idea is that as the noise levelincreases, the likelihood of accurate ASR captioning will typicallydecrease and therefore more accurate and robust captioning optionsshould be available.

As another instance, an HU device may transmit a known signal to an AUdevice which returns the known signal to the HU device and the HU devicemay compare the received signal to the known signal to determine line orcommunication link quality. Here, the HU may present a line qualityvalue as shown at 808 in FIG. 24 for the HU to consider. Similarly, anAU device may generate a line quality value in a similar fashion and maypresent the line quality signal (not illustrated) to the AU to beconsidered.

In some cases system devices may monitor a plurality of different systemoperating characteristics such as line quality, speaker phone use,non-voice noise level, voice volume level, voice signal pace, etc., andmay present one or more “coaching” indications to any one of or a subsetof the HU, CA and AU for consideration. Here, the coaching indicationsshould help the parties to a call understand if there is something theycan do to increase the level of captioning accuracy. Here, in at leastsome cases only the most impactful coaching indications may be presentedand different entities may receive different coaching indications. Forinstance, where noise at HU location exceeds a threshold level, a noiseindicating signal may only be presented to the HU. Where the system alsorecognizes that line quality is only average, that indication may bepresented to the AU and not to the HU while the HU's noise level remainshigh. If the HU moves to a quieter location, the noise level indicationon the HU device may be replaced with a line quality indication. Thus,the coaching indications should help individual call entities recognizecommunication conditions that they can effect or that may be the causeof or may lead to poor captioning results for the AU.

In some cases coaching may include generating a haptic feedback oraudible signal or both and a text message for an HU and/or an AU. Tothis end, while AU's routinely look at their devices to see captionsduring a caption assisted call, many HUs do not look at their devicesduring a call and simply rely on audio during communication. In the caseof an AU, in some cases even when captioning is presented to an AU theAU may look away from their device display at times when their hearingis sufficient. By providing a haptic or audible or both additionalsignals, a user's attention can be drawn to their device displays wherea warning or call state text message may present more information suchas, for instance, an instruction to “Speak louder” or “Move to a lessnoisy space”, for consideration.

Text Lag Constraints

In some embodiments an AU may be able to set a maximum text lag timesuch that automated text generated by an ASR engine is used to drive anAU device screen 18 when a CA generated text lag reaches the maximumvalue. For instance, an AU may not want text to lag behind a broadcastHU voice signal by more than 7 seconds and may be willing to accept agreater error rate to stay within the maximum lag time period. Here, CAcaptioning/correction may proceed until the maximum lag time occurs atwhich point automated text may be used to fill in the lag period up to acurrent HU voice signal on the AU device and the CA may be skipped aheadto the current HU signal automatically to continue the captioningprocess. Again, here, any automated fill in text or text not correctedby a CA may be visually distinguished on the AU device display as wellas on the CA display for consideration.

It has been recognized that many AU's using text to understand abroadcast HU voice signal prefer that the text lag behind the voicesignal at least some short amount of time. For instance, an AU talkingto an HU may stair off into space while listening to the HU voice signaland, only when a word or phrase is not understood, may look to text ondisplay 18 for clarification. Here, if text were to appear on a display18 immediately upon audio broadcast to an AU, the text may be severalwords beyond the misunderstood word by the time the AU looks at thedisplay so that the AU would be required to hunt for the word. For thisreason, in at least some embodiments, a short minimum text delay may beimplemented prior to presenting text on display 18. Thus, all text wouldbe delayed at least 2 seconds in some cases and perhaps longer where atext generation lag time exceeds the minimum lag value. As with otheroperating parameters, in at least some cases an AU may be able to adjustthe minimum voice-to-text lag time to meet a personal preference.

It has been recognized that in cases where transcription switchesautomatically from a CA to an ASR engine when text lag exceeds somemaximum lag time, it will be useful to dynamically change the thresholdperiod as a function of how a communication between an HU and an AU isprogressing. For instance, periods of silence in an HU voice signal maybe used to automatically adjust the maximum lag period. For example, insome cases if silence is detected in an HU voice signal for more thanthree seconds, the threshold period to change from CA text to automatictext generation may be shortened to reflect the fact that when the HUstarts speaking again, the CA should be closer to a caught up state.Then, as the HU speaks continuously for a period, the threshold periodmay again be extended. The threshold period prior to automatictransition to the ASR engine to reduce or eliminate text lag may bedynamically changed based on other operating parameters. For instance,rate of error correction by a CA, confidence factor average in ASR text,line quality, noise accompanying the HU voice signal, or any combinationof these and other factors may be used to change the threshold period.

One aspect described above relates to an ASR engine recognizing specificor important phrases like questions (e.g., see phrase “Are you stillthere?”) in FIG. 18 prior to CA text generation and presenting thosephrases immediately to an AU upon detection. Other important phrases mayinclude phrases, words or sound anomalies that typically signify “turnmarkers” (e.g., words or sounds often associated with a change inspeaker from AU to HU or vice versa). For instance, if an HU utters thephrase “What do you think?” followed by silence, the combinationincluding the silent period may be recognized as a turn marker and thephrase may be presented immediately with space markers (e.g., underlinedspaces) between CA text and the phrase to be filled in by the CA texttranscription once the CA catches up to the turn marker phrase.

To this end, see the text at 731 in FIG. 22 where CA generated text isshown at 733 with a lag time indicated by underlined spaces at 735 andan ASR recognized turn marker phrase presented at 737. In this type ofsystem, in some cases the ASR engine will be programmed with a small set(e.g., 100-300) of common turn marker phrases that are specificallysought in an HU voice signal and that are immediately presented to theAU when detected. In some cases, non-text voice characteristics like thechange in sound that occurs at the end of a question which is often thesignal for a turn marker may be sought in an HU voice signal and any ASRgenerated text within some prior period (e.g., 5 seconds, the previous 8words, etc.) may be automatically presented to an AU.

Automatic Voice Signal Routing Based on Call Type

It has been recognized that some types of calls can almost always beaccurately handled by an ASR engine. For instance, auto-attendant typecalls can typically be transcribed accurately via an ASR. For thisreason, in at least some embodiments, it is envisioned that a systemprocessor at the AU device or at the relay may be able to determine acall type (e.g., auto-attendant or not, or some other call typeroutinely accurately handled by an ASR engine) and automatically routecalls within the overall system to the best and most efficient/effectiveoption for text generation. Thus, for example, in a case where an AUdevice manages access to an ASR operated by a third party and accessiblevia an internet link, when an AU places a call that is received by anauto-attendant system, the AU device may automatically recognize theanswering system as an auto-attendant type and instead of transmittingthe auto-attendant voice signal to a relay for CA transcription, maytransmit the auto-attendant voice signal to the third party ASR enginefor text generation.

In this example, if the call type changes mid-stream during itsduration, the AU device may also transmit the received voice signal to aCA for captioning if appropriate. For instance, if an interactive voicerecognition auto-attendant system eventually routes the AU's call to alive person (e.g., a service representative for a company), once thelive person answers the call, the AU device processor may recognize theperson's voice as a non-auto-attendant signal and route that signal to aCA for captioning as well as to the ASR for voice model training. Inthese cases, the ASR engine may be specially tuned to transcribeauto-attendant voice signals to text and, when a live HU gets on theline, would immediately start training a voice model for that HU's voicesignal.

Synchronizing Voice and Text for Playback

In cases or at times when HU voice signals are transcribed automaticallyto text via an ASR engine when a CA is only correcting ASR generatedtext, the relay may include a synchronizing function or capability sothat, as a CA listens to an HU's voice signal during an error correctionprocess, the associated text from the ASR is presented generallysynchronously to the CA with the HU voice signal. For instance, in somecases an ASR transcribed word may be visually presented via a CA display50 at substantially the same instant at which the word is broadcast tothe CA to hear. As another instance, the ASR transcribed word may bepresented one, two, or more seconds prior to broadcast of that word tothe CA.

In still other cases, the ASR generated text may be presented forcorrection via a CA display 50 immediately upon generation and, as theCA controls broadcast speed of the HU voice signal for correctionpurposes, the word or phrase instantaneously audibly broadcast may behighlighted or visually distinguished in some fashion. To this end, seeFIG. 23 where automated ASR generated text is shown at 748 where a wordinstantaneously audibly broadcast to a CA (see 752) is simultaneouslyhighlighted at 750. Here, as the words are broadcast via CA headset 54,the text representations of the words are highlighted or otherwisevisually distinguished to help the error correcting CA follow along.Here, highlighting may be linked to the start time of a word beingbroadcast, to the end time of the word being broadcast, or in any otherway to the start or end time of the word. For instance, in some cases aword may be highlighted one second prior to broadcast of the word andmay remain highlighted for one second subsequent to the end time of thebroadcast so that several words are typically highlighted at a timegenerally around a currently audibly broadcast word.

As another example, see FIG. 23A where ASR generated text is shown at748A. Here, a word 752A instantaneously broadcast to a CA via headset 54is highlighted at 750A. In this case, however, ASR text scrolls up aswords are audibly broadcast to the CA so that a line of text includingan instantaneously broadcast word is always generally located at thesame vertical height on the display screen 50 (e.g., just above ahorizontal center line in the exemplary embodiment in FIG. 23A). Here,by scrolling the text up, unless correcting text in a different line,the CA can simply focus on the one line of text presented in stationaryfield 753 and specifically the highlighted word at 750A to focus on theword audibly broadcast. In other cases it is contemplated that thehighlight at 750A may in fact be a stationary word field and that eventhe line of text in field 753 may scroll from right to left so that theinstantaneously broadcast word will be located in a stationary wordfield generally near the center of the screen 50. In this way the CA maybe able to simply concentrate on one screen location to view thebroadcast word.

Referring still to FIG. 23A, a selectable button 751 (hereinafter a“caption source switch button” unless indicated otherwise) allows a CAto manually switch from the ASR text generation to full CA assistancewhere the CA generates text and corrects that text instead of startingwith ASR generated text. In addition, a “seconds behind” field 755 ispresented proximate the highlighted broadcast word 750A so that the CAhas ready access to that field to ascertain how far behind the CA is interms of listening to the HU voice message for correction. In addition,an HU silent field 757 is presented that indicates a duration of timebetween HU voice message segments during which the HU remains silent(e.g., does not speak). Here, in some cases the HU may simply pause toallow the AU to respond and that pause would be considered silence.

Referring still to FIG. 23A, field 755 indicates that the audiblebroadcast is only 12.2 seconds behind despite the illustrated 20 secondsof HU silence at 757 and many ASR words that follow the instantaneouslybroadcast word at 750A. Here, a system processor accounts for the 20seconds of HU silence when calculating the seconds behind value as thesystem can remove that silent period from CA consideration so that theCA can catch up more quickly. Thus, in the FIG. 23A example, theduration of time between when an HU actually uttered the words“restaurant” at 750A and “not” at 759 may be 32.2 seconds but the systemcan recognize that the HU was silent during 20 of those seconds so thatthe seconds behind calculation may be 12.2 seconds as shown.

In at least some cases when the seconds behind delay exceeds somethreshold value, the system may automatically indicate that condition asa warning or alert to the CA. For instance, assume that the thresholddelay is four seconds. Here, when the second behind value exceeds fourseconds, in at least some cases, the seconds behind field may behighlighted or otherwise visually distinguished as an alert. In FIG.23A, field 755 is shown as left down to right cross hatched to indicatethe color red as an alert because the four second delay threshold isexceeded.

In at least some cases it is contemplated that more sophisticatedalgorithms may be implemented for determining when to alert the CA to acircumstance where the seconds behind period becomes problematic. Forinstance, where a seconds behind duration is 12.2 seconds as in FIG.23A, that magnitude of duration may not warrant an alert if confidencefactors associated with ASR generated text thereafter are all extremelyhigh as accurate ASR text thereafter should enable the CA to catch uprelatively quickly to reduce the seconds behind period rapidly. Forinstance, where ASR text confidence factors are high, the system mayautomatically double the broadcast rate of the HU voice signal so thatthe 12.2 second delay can be worked to a zero value in half that time.

As another instance, because HUs speak at different rates at differenttimes, rate of HU speaking or density of words spoken during a timesegment may be used to qualify the delay between a broadcast word and amost recent ASR word generated. For instance, assume a 15 second delaybetween when a word is broadcast to a CA and the time associated withthe most recent ASR generated text. Here, in some cases an HU may utter3 words during the 15 second period while in other cases the HU may haveuttered 30 words during that same period. Clearly, the time required fora CA's to work the 15 second delay downward is a function of the densityof words uttered by the HU in the intervening time. Here, whether or notto issue the alert would be a function of word density during the delayperiod.

As yet one other instance, instead of assessing delay by a duration oftime, the relay may be based on a number of words between a mostrecently generated ASR word and the word that is currently beingconsidered by a CA (e.g., the most current word in an HU voice signalconsidered by the CA). Here, an alert may be issued to the CA when theCA is a threshold number of words behind the most recent ASR generatedword. For example, the threshold may be 12 words.

Many other factors may be used to determine when to issue CA delayalerts. For instance, a CAs metrics related to specific HU voicecharacteristics, voice signal quality factors, etc., may each be usedseparately or in combination with other factors to assess when an alertis prudent.

In addition to affecting when to issue a delay alert to a user, theabove factors may be used to alter the seconds behind value in field 755to reflect an anticipated duration of time required by a specific CA tocatch up to the most recently generated ASR text. For instance, in FIG.23A if, based on one or more of the above factors, the systemanticipates that it will take the CA 5 seconds to catch up on the 12.2second delay, the seconds behind value may be 5.0 seconds as opposed to12.2 (e.g., in a case where the system speeds up the rate of HU voicesignal broadcast through high confidence ASR text).

In at least some cases an error correcting CA will be able to skip backand forth within the HU voice signal to control broadcast of the HUvoice signal to the CA. For instance, as described above, a CA may havea foot pedal or other control interface device useable to skip back in abuffered HU voice recording 5, 10, etc., seconds to replay an HU voicesignal recording. Here, when the recording skips back, the highlightedtext in representation 748 would likewise skip back to be synchronizedwith the broadcast words. To this end, see FIG. 25 where, in at leastsome cases, a foot pedal activation or other CA input may cause therecording to skip back to the word “pizza” which is then broadcast as at764 and highlighted in text 748 as shown at 762. In other cases, the CAmay simply single tap or otherwise select any word presented on display50 to skip the voice signal play back and highlighted text to that word.For instance, in FIG. 25 icon 766 represents a single tap which causesthe word “pizza” to be highlighted and substantially simultaneouslybroadcast. Other word selecting gestures (e.g., a mouse control click,etc.) are contemplated.

In some embodiments when a CA selects a text word to correct, the voicesignal replay may automatically skip to some word in the voice bufferrelative to the selected word and may halt voice signal replayautomatically until the correction has been completed. For instance, adouble tap on the word “pals’ in FIG. 23 may cause that word to behighlighted for correction and may automatically cause the point in theHU voice replay to move backward to a location a few words prior to theselected word “pals.” To this end, see in FIG. 25 that the word “Pete's”that is still highlighted as being corrected (e.g., the CA has notconfirmed a complete correction) has been typed in to replace the word“Pals” and the word “pizza” that precedes the word “Pete's” has beenhighlighted to indicate where the HU voice signal broadcast will againcommence after the correction at 760 has been completed. While backwardreplay skipping has been described, forward skipping is alsocontemplated.

In some cases, when a CA selects a word in presented text for correctionor at least to be considered for correction, the system may skip to alocation a few words prior to the selected word and may represent the HUvoice signal stating at that point and ending a few words after thatpoint to give a CA context in which to hear the word to be corrected.Thereafter, the system may automatically move back to a subsequent pointin the HU voice signal at which the CA was when the word to be correctedwas selected. For instance, again, in FIG. 25 , assume that the HU voicebroadcast to a CA is at the word “catch” 761 when the CA selects theword “Pete's 760 for correction. In this case, the CA's interface mayskip back in the HU voice signal to the word pizza at 762 andre-broadcast the phrase parts from the word “pizza” to the word “want”763 to provide immediate context to the CA. After broadcasting the word“want”, the interface would skip back to the word “catch” 761 andcontinue broadcasting the HU voice signal from that point on.

In at least some embodiments where an ASR engine generates automatictext and a CA is simply correcting that text prior to transmission to anAU, the ASR engine may assign a confidence factor to each word generatedthat indicates how likely it is that the word is accurate. Here, in atleast some cases, the relay server may highlight any text on thecorrecting CA's display screen that has a confidence factor lower thansome threshold level to call that text to the attention of the CA forspecial consideration. To this end, see again FIG. 23 where variouswords (e.g., 777, 779, 781) are specially highlighted in theautomatically generated ASR text to indicate a low confidence factor.

While AU voice signals are not presented to a CA in most cases forprivacy reasons, it is believed that in at least some cases a CA mayprefer to have some type of indication when an AU is speaking to helpthe CA understand how a communication is progressing. To this end, in atleast some embodiments an AU device may sense an AU voice signal and atleast generate some information about when the AU is speaking. Thespeaking information, without word content, may then be transmitted inreal time to the CA at the relay and used to present an indication thatthe AU is speaking on the CA screen. For instance, see again FIG. 23where lines 783 are presented on display 50 to indicate that an AU isspeaking. As shown, lines 783 are presented on a right side of thedisplay screen to distinguish the AU's speaking activity from the textand other visual representations associated with the HU's voice signal.As another instance, when the AU speaks, a text notice 797 or somegraphical indicator (e.g., a talking head) may be presented on the CAdisplay 50 to indicate current speaking by an AU. While not shown it iscontemplated that some type of non-content AU speaking indication like783 may also be presented to an AU via the AU's device to help the AUunderstand how the communication is progressing.

Sequential Short Duration Third Party Caption Requests

It has been recognized that some third party ASR systems available viathe internet or the like tend to be extremely accurate for short voicesignal durations (e.g., 15-30 seconds) after which accuracy becomes lessreliable. To deal with ASR accuracy degradation during an ongoing call,in at least some cases where a third party ASR system is employed togenerate automated text, the system processor (e.g., at the relay, inthe AU device or in the HU device) may be programmed to generate aseries of automatic text transcription requests where each request onlytransmits a short sub-set of a complete HU voice signal. For instance, afirst ASR request may be limited to a first 15 seconds of HU voicesignal, a second ASR request may be limited to a next 15 seconds of HUvoice signal, a third ASR request may be limited to a third 15 secondsof HU voice signal, and so on. Here, each request would present theassociated HU signal to the ASR system immediately and continuously asthe HU voice signal is received and transcribed text would be receivedback from the ASR system during the 15 second period. As the text isreceived back from the ASR system, the text would be cobbled together toprovide a complete and relatively accurate transcript of the HU voicesignal.

While the HU voice signal may be divided into consecutive periods insome cases, in other cases it is contemplated that the HU voice signalslices or sub-periods sent to the ASR system may overlap at leastsomewhat to ensure all words uttered by an HU are transcribed and toavoid a case where words in the HU voice signal are split among periods.For instance, voice signal periods may be 30 seconds long and each mayoverlap a preceding period by 10 seconds and a following period by 10seconds to avoid split words. In addition to avoiding a split wordproblem, overlapping HU voice signal periods presented to an ASR systemallows the system to use context represented by surrounding words tobetter (e.g., contextually) covert HU voiced words to text. Thus, a wordat the end of a first 20 second voice signal period will be near thefront end of the overlapping portion of a next voice signal period andtherefore, typically, will have contextual words prior to and followingthe word in the next voice signal period so that a more accuratecontextually considered text representation can be generated.

In some cases, a system processor may employ two, three or moreindependent or differently tuned ASR systems to automatically generateautomated text and the processor may then compare the text results andformulate a single best transcript representation in some fashion. Forinstance, once text is generated by each engine, the processor may pollfor most common words or phrases and then select most common as text toprovide to an AU, to a CA, to a voice modeling engine, etc.

Default ASR, User Selects Call Assistance

In most cases automated text (e.g., ASR generated text) will begenerated much faster than CA generated text or at least consistentlymuch faster. It has been recognized that in at least some cases an AUwill prefer even uncorrected automated text to CA corrected text wherethe automated text is presented more rapidly generated and thereforemore in sync with an audio broadcast HU voice signal. For this reason,in at least some cases, a different and more complex voice-to-texttriage process may be implemented. For instance, when an AU-HU callcommences and the AU requires text initially, automated ASR generatedtext may initially be provided to the AU. If a good HU voice modelexists for the HU, the automated text may be provided without CAcorrection at least initially. If the AU, a system processor, or an HUdetermines that the automated text includes too many errors or if someother operating characteristic (e.g., line noise) that may affect texttranscription accuracy is sensed, a next level of the triage process maylink an error correcting CA to the call and the ASR text may bepresented in essentially real time to the CA via display 50simultaneously with presentation to the AU via display 18.

Here, as the CA corrects the automated text, corrections areautomatically sent to the AU device and are indicated via display 18.Here, the corrections may be in-line (e.g., erroneous text replaced),above error, shown after errors, may be visually distinguished viahighlighting or the like, etc. Here, if too many errors continue topersist from the AU's perspective, the AU may select an AU device button(e.g., see 68 again in FIG. 1 ) to request full CA transcription.Similarly, if an error correcting CA perceives that the ASR engine isgenerating too many errors, the error correcting CA may perform someaction to initiate full CA transcription and correction. Similarly, arelay processor or even an AU device processor may detect that an errorcorrecting CA is having to correct too many errors in the ASR generatedtext and may automatically initiate full CA transcription andcorrection.

In any case where a CA takes over for an ASR engine to generate text,the ASR engine may still operate on the HU voice signal to generate textand use that text and CA generated text, including corrections, torefine a voice model for the HU. At some point, once the voice modelaccuracy as tested against the CA generated text reaches some thresholdlevel (e.g., 95% accuracy), the system may again automatically or at thecommand of the transcribing CA or the AU, revert back to the CAcorrected ASR text and may cut out the transcribing CA to reduce costs.Here, if the ASR engine eventually reaches a second higher accuracythreshold (e.g., 98% accuracy), the system may again automatically or atthe command of an error correcting CA or an AU, revert back to theuncorrected ASR text to further reduce costs.

AU Accuracy-Speed Preference Selection

In at least some cases it is contemplated that an AU device may allow anAU to set a personal preference between text transcription accuracy andtext speed. For instance, a first AU may have fairly good hearing andtherefore may only rely on a text transcript periodically to identify aword uttered by an HU wile a second AU has extremely bad hearing andeffectively reads every word presented on an AU device display. Here,the first AU may prefer text speed at the expense of some accuracy whilethe second AU may require accuracy even when speed of text presentationor correction is reduced. An exemplary AU device tool is shown as anaccuracy/speed scale 770 in FIG. 18 where an accuracy/speed selectionarrow 772 indicates a current selected operating characteristic. Here,moving arrow 772 to the left, operating parameters like correction time,ASR operation etc., are adjusted to increase accuracy at the expense ofspeed and moving arrow 772 right on scale 770 increases speed of textgeneration at the expense of accuracy.

In at least some embodiments when arrow 772 is moved to the right sospeed is preferred over greater accuracy, the system may respond to thesetting adjustment by opting for automated text generation as opposed toCA text generation. In other cases where a CA may still perform at leastsome error corrections despite a high speed setting, the system maylimit the window of automated text that a CA is able to correct to asmall time window trailing a current time. Thus, for instance, insteadof allowing a CA to correct the last 30 seconds of automated text, thesystem may limit the CA to correcting only the most recent 7 seconds oftext so that error corrections cannot lag too far behind current HUutterances.

Where an AU moves arrow 772 to the left so that speed is sacrificed forgreater caption accuracy, the system may delay delivery of evenautomated text to an AU for some time so that at least some automatederror corrections are made prior to delivery of initial text captions toan AU. The delay may even be until a CA has made at least some or evenall caption corrections. Other ways of speeding up text generation orincreasing accuracy at the expense of speed are contemplated.

Audio-Text Synchronization Adjustment

In at least some embodiments when text is presented to an errorcorrecting CA via a CA display 50, the text may be presented at leastslightly prior to broadcast of (e.g., ¼ to 2 seconds) an associated HUvoice signal. In this regard, it has been recognized that many CAsprefer to see text prior to hearing a related audio signal and link thetwo optimally in their minds when text precedes audio. In other casesspecific CAs may prefer simultaneous text and audio and still others mayprefer audio before text. In at least some cases it is contemplated thata CA workstation may allow a CA to set text-audio sync preferences. Tothis end, see exemplary text-audio sync scale 765 in FIG. 25 thatincludes a sync selection arrow 767 that can be moved along the scale tochange text-audio order as well as delay or lag between the two.

In at least some embodiments an on-screen tool akin to scale 765 andarrow 767 may be provided on an AU device display 18 to adjust HU voicesignal broadcast and text presentation timing to meet an AU'spreferences.

System Options Based on HU's Voice Characteristics

It has been recognized that some AU's can hear voice signals with aspecific characteristic set better than other voice signals. Forinstance, one AU may be able to hear low pitch traditionally male voicesbetter than high pitch traditionally female voice signals. In someembodiments an AU may perform a commissioning procedure whereby the AUtests capability to accurately hear voice signals having differentcharacteristics and results of those capabilities may be stored in asystem database. The hearing capability results may then be used toadjust or modify the way text captioning is accomplished. For instance,in the above case where an AU hears low pitch voices well but not highpitch voices, if a low pitch HU voice is detected when a call commences,the system may use the ASR function more rapidly than in the case of ahigh pitched voice signal. Voice characteristics other than pitch may beused to adjust text transcription and ASR transition protocols insimilar ways.

In some cases it is contemplated that an AU device or other systemdevice may be able to condition an incoming HU voice signal so that thesignal is optimized for a specific AU's hearing deficiency. Forinstance, assume that an AU only hears high pitch voices well. In thiscase, if a high pitch HU voice signal is received at an AU's device, theAU's device may simply broadcast that voice signal to the AU to beheard. However, if a low pitch HU voice signal is received at the AU'sdevice, the AU's device may modify that voice signal to convert it to ahigh pitch signal prior to broadcast to the AU so that the A can betterhear the broadcast voice. This automatic voice conditioning may beperformed regardless of whether or not the system is presentingcaptioning to an AU.

In at least some cases where an HU device like a smart phone, tablet,computing device, laptop, smart watch, etc., has the ability to storedata or to access data via the internet, a WIFI system or otherwise thatis stored on a local or remote (e.g., cloud) server, it is contemplatedthat every HU device or at least a subset used by specific HUs may storean HU voice model for an associated HU to be used by a captioningapplication or by any software application run by the HU device. Here,the HU model may be trained by one or more applications run on the HUdevice or by some other application like an ASR system associated withone of the captioning systems described herein that is run by an AUdevice, the relay server, or some third party server or processor. Here,for example, in one instance, an HU's voice model stored on an HU devicemay be used to drive a voice-to-text search engine input tool to providetext for an internet search independent of the captioning system. Themulti-use and perhaps multi-application trained HU voice model may alsobe used by a captioning ASR system during an AU-HU call. Here, the voicemodel may be used by an ASR application run on the HU device, run on theAU device, run by the relay server or run by a third party server.

In cases where an HU voice model is accessible to an ASR engineindependent of an HU device, when an AU device is used to place a callto an HU device, an HU model associated with the number called may beautomatically prepared for generating captions even prior to connectionto the HU device. Where a phone or other identifying number associatedwith an HU device can be identified prior to an AU answering a call fromthe HU device, again, an HU voice model associated with the HU devicemay be accessed and readied by the captioning system for use prior tothe answering action to expedite ASR text generation. Most people useone or a small number of phrases when answering an incoming phone call.Where an HU voice model is loaded prior to an HU answering a call, theASR engine can be poised to detect one of the small number of greetingphrases routinely used to answer calls and to compare the HU's voicesignal to the model to confirm that the voice model is for the specificHU that answers the call. If the HU's salutation upon answering the calldoes not match the voice model, the system may automatically link to aCA to start a CA controlled captioning process.

While at least some systems will include HU voice models, it should beappreciated that other systems may not and instead may rely on robustvoice to text software algorithms that train to specific voices overrelatively short durations so that every new call with an HU causes thesystem to rapidly train anew to a received HU voice signal. Forinstance, in many cases a voice model can be at least initially trainedwithin tens of seconds to specific voices after which the modelscontinue to train over the duration of a call to become more accurate asa call proceeds. In at least some of these cases there is no need forvoice model storage.

Presenting Captions for AU Voice Messages

While a captioning system must provide accurate text corresponding to anHU voice signal for an AU to view when needed, typical relay systems fordeaf and hard of hearing person would not provide a transcription of anAU's voice signal. Here, generally, the thinking has been that an AUknows what she says in a voice signal and an HU hears that signal andtherefore text versions of the AU's voice was not necessary. This,coupled with the fact that AU captioning would have substantiallyincreased the transcription burden on CAs (e.g., would have required CArevoicing or typing and correction of more voice signal (e.g., the AUvoice signal)) meant that AU voice signal transcription simply was notsupported. Another reason AU voice transcription was not supported wasthat at least some AUs, for privacy reasons, do not want both sides ofconversations with HUs being listened to by CAs.

In at least some embodiments, it is contemplated that the AU side of aconversation with an HU may be transcribed to text automatically via anASR engine and presented to the AU via a device display 18 while the HUside of the conversation is transcribed to text in the most optimal waygiven transcription triage rules or algorithms as described above. Here,the AU voice captions and AU voice signal would never be presented to aCA. Here, while AU voice signal text may not be necessary in some cases,in others it is contemplated that many AUs may prefer that text of theirvoice signals be presented to be referred back to or simply as anindication of how the conversation is progressing. Seeing both sides ofa conversation helps a viewer follow the progress more naturally. Here,while the ASR generated AU text may not always be extremely accurate,accuracy in the AU text is less important because, again, the AU knowswhat she said.

Where an ASR engine automatically generates AU text, the ASR engine maybe run by any of the system processors or devices described herein. Inparticularly advantageous systems the ASR engine will be run by the AUdevice 12 where the software that transcribes the AU voice to text istrained to the voice of the AU and therefore is extremely accuratebecause of the personalized training.

Thus, referring again to FIG. 1 , for instance, in at least someembodiments, when an AU-HU call commences, the AU voice signal may betranscribed to text by AU device 12 and presented as shown at 822 inFIG. 26 without providing the AU voice signal to relay 16. The HU voicesignal, in addition to being audibly broadcast via AU device 12, may betransmitted in some fashion to relay 16 for conversion to text when sometype of CA assistance is required. Accurate HU text is presented ondisplay 18 at 820. Thus, the AU gets to see both AU text, albeit withsome errors, and highly accurate HU text. Referring again to FIG. 24 ,in at least some cases, AU and HU text may also be presented to an HUvia an HU device (e.g., a smart phone) in a fashion similar to thatshown in FIG. 26 .

Referring still to FIG. 26 , where both HU and AU text are generated andpresented to an AU, the HU and AU text may be presented in staggeredcolumns as shown along with an indication of how each textrepresentation was generated (e.g., see titles at top of each column inFIG. 26 ).

In at least some cases it is contemplated that an AU may, at times, noteven want the HU side of a conversation to be heard by a CA for privacyreasons. Here, in at least some cases, it is contemplated that an AUdevice may provide a button or other type of selectable activator toindicate that total privacy is required and then to re-establish relayor CA captioning and/or correction again once privacy is no longerrequired. To this end, see the “Complete Privacy” button or virtual icon826 shown on the AU device display 18 in FIG. 26 . Here, it iscontemplated that, while an AU-HU conversation is progressing and a CAgenerates/corrects text 820 for an HU's voice signal and an ASRgenerates HU text 822, if the AU wants complete privacy but still wantsHU text, the AU would select icon 826. Once icon 826 is selected, the HUvoice signal would no longer be broadcast to the CA and instead an ASRengine would transcribe the AU voice signal to automated text to bepresented via display 18. Icon 826 in FIG. 26 would be changed to “CACaption” or something to that effect to allow the AU to again start fullCA assistance when privacy is less of a concern.

Other Triggers for Automated Catch Up Text

In addition to a voice-to-text lag exceeding a maximum lag time, theremay be other triggers for using ASR engine generated text to catch an AUup to an HU voice signal. For instance, in at least some cases an AUdevice may monitor for an utterance from an AU using the device and mayautomatically fill in ASR engine generated text corresponding to an HUvoice signal when any AU utterance is identified. Here, for example,where CA transcription is 30 seconds behind an HU voice signal, if an AUspeaks, it may be assumed that the AU has been listening to the HU voicesignal and is responding to the broadcast HU voice signal in real time.Because the AU responds to the up to date HU voice signal, there may beno need for an accurate text transcription for prior HU voice phrasesand therefore automated text may be used to automatically catch up. Inthis case, the CA's transcription task would simply be moved up in timeto a current real time HU voice signal automatically and the CA wouldnot have to consider the intervening 30 seconds of HU voice fortranscription or even correction. When the system skips ahead in the HUvoice signal broadcast to the CA, the system may present some clearindication that it is skipping ahead to the CA to avoid confusion. Forinstance, when the system skips ahead, a system processor may present asimultaneous warning on the CA display screen indicating that the systemis skipping intervening HU voice signal to catch the CA up to real time.

As another example, when an AU device or other system device recognizesa turn marker in an HU voice signal, all ASR generated text that isassociated with a lag time may be filled in immediately andautomatically.

As still one other instance, an AU device or other device may monitor AUutterances for some specific word or phrase intended to trigger anupdate of text associated with a lag time. For instance, the AU maymonitor for the word “Update” and, when identified, may fill in the lagtime with automated text. Here, in at least some cases, the AU may beprogrammed to cancel the catch-up word “Update” from the AU voice signalsent to the HU device. Thus, here, the AU utterance “Update” would havethe effect of causing ASR text to fill in a lag time without beingtransmitted to the HU device. Other commands may be recognized andautomatically removed from the AU voice signal.

Thus, it should be appreciated that various embodiments of asemi-automated automatic voice recognition or text transcription systemto aid hearing impaired persons when communicating with HUs have beendescribed. In each system there are at least three entities and at leastthree devices and in some cases there may be a fourth entity and anassociated fourth device. In each system there is at least one HU andassociated device, one AU and associated device and one relay andassociated device or sub-system while in some cases there may also be athird party provider (e.g., a fourth party) of ASR services operatingone or more servers that run ASR software. The HU device, at a minimum,enables an HU to annunciate words that are transmitted to an AU deviceand receives an AU voice signal and broadcasts that signal audibly forthe HU to hear.

The AU device, at a minimum, enables an AU to annunciate words that aretransmitted to an HU device, receives an HU voice signal and broadcaststhat signal (e.g., audibly, via Bluetooth where an AU uses a hearingaid) for the AU to attempt to hear, receives or generates transcribedtext corresponding to an HU voice signal and displays the transcribedtext to an AU on a display to view.

The relay, at a minimum, at times, receives the HU voice signal andgenerates at least corrected text that may be transmitted to anothersystem device.

In some cases where there is no fourth party ASR system, any of theother functions/processes described above may be performed by any of theHU device, AU device and relay server. For instance, the HU device insome cases may store an HU voice model and/or voice characteristicsmodel, an ASR application and a software program for managing whichtext, ASR or CA generated, is used to drive an AU device. Here, the HUmay link directly with each of the AU device and relay, and may operateas an intermediary therebetween.

As another instance, HU models, ASR software and caption controlapplications may be stored and used by the AU device processor or,alternatively, by the relay server. In still other instances differentsystem components or devices may perform different aspects of afunctioning system. For instance, an HU device may store an HU voicemodel which may be provided to an AU device automatically at thebeginning of a call and the AU device may transmit the HU voice modelalong with a received HU voice signal to a relay that uses the model totune an ASR engine to generate automated text as well as provides the HUvoice signal to a first CA for revoicing to generate CA text and asecond CA for correcting the CA text. Here, the relay may transmit andtranscribe text (e.g., automated and CA generated) to the AU device andthe AU device may then select one of the received texts to present viathe AU device screen. Here CA captioning and correction and transmissionof CA text to the AU device may be halted in total or in part at anytime by the relay or, in some cases, by the AU device, based on variousparameters or commands received from any parties (e.g., AU, HU, CA)linked to the communication.

In cases where a fourth party to the system operates an ASR engine inthe cloud or otherwise, at a minimum, the ASR engine receives an HUvoice signal at least some of the time and generates automated textwhich may or may not be used at times to drive an AU device display.

In some cases it is contemplated that ASR engine text (e.g., automatedtext) may be presented to an HU while CA generated text is presented toan AU and a most recent word presented to an AU may be indicated in thetext on the HU device so that the HU has a good sense of how far behindan AU is in following the HU's voice signal. To this end, see FIG. 27that shows an exemplary HU smart phone device 800 including a display801 where text corresponding to an HU voice signal is presented for theHU to view at 848. The text 848 includes text already presented to an AUprior to and including the word “after” that is shown highlighted 850 aswell as ASR engine generated text subsequent to the highlight 850 that,in at least the illustrated embodiment, may not have been presented tothe AU at the illustrated time. Here, an HU viewing display 801 can seewhere the AU is in receiving text corresponding to the HU voice signal.The HU may use the information presented as a coaching tool to help theHU regulate the speed at which the HU converses. In addition toindicating the most recent textual word presented to the AU, the mostrecent word audibly broadcast to the AU may be visually highlighted asshown at 847 as well.

To be clear, where an HU device is a smart phone or some other type ofdevice that can run an application program to participate in acaptioning service, many different linking arrangements between the AU,HU and a relay are contemplated. For instance, in some cases the AU andHU may be directly linked and there may be a second link or line fromthe AU to the relay for voice and data transmission when necessarybetween those two entities. As another instance, when an HU and AU arelinked directly and relay services are required after the initial link,the AU device may cause the HU device to link directly to the relay andthe relay may then link to the AU device so that the relay is locatedbetween the AU and HU devices and all communications pass through therelay. In still another instance, an HU device may link to the relay andthe relay to the AU device and the AU device to the HU device so thatany communications, voice or data, between two of the three entities isdirect without having to pass through the other entity (e.g., HU and AUvoice signals would be directly between HU and AU devices, HU voicesignal would be direct from the HU device to the relay and transcribedtext associated with the HU voice would be directly passed from therelay to the AU device to be displayed to the AU. Here, any textgenerated at the relay to be presented via the HU device would betransmitted directly from the relay to the HU device and any textgenerated by either one of the AU or HU devices (e.g., via an ASRengine) would be directly transmitted to the receiving device. Thus, anHU device or captioning application run thereby may maintain a directdial number or address for the relay and be able to link up to the relayautomatically when CA or other relay services are required.

Referring now to FIG. 28 , a schematic is shown of an exemplarysemi-automated captioning system that is consistent with at least someaspects of the present disclosure. The system enables an HU using device14 to communicate with an AU using AU device 12 where the AU receivestext and HU voice signals via the AU device 12. Each of the HU and theAU link into a gateway server or other computing device 900 that islinked via a network of some type to a relay. HU voice signals are fedthrough a noise reducing audio optimizer to a 3 pole or path ASR switchdevice 904 that is controlled by an adaptive ASR switch controller 932to select one of first, second and third text generating processesassociated with switch output leads 940, 942 and 944, respectively. Thefirst text generating process is an automated ASR text process whereinan ASR engine generates text without any input (e.g., data entry,correction, etc.) from any CA. The second text generating process is aprocess wherein a CA 908 revoices an HU voice or types to generate textcorresponding to an HU voice signal and then corrects that text. Thethird text generating process is one wherein the ASR engine generatesautomated text and a correcting CA 912 makes corrections to theautomated text. In the second process, the ASR engine operates inparallel with the CA to generate automated text in parallel to the CAgenerated and corrected text.

Referring still to FIG. 28 , with switch 904 connected to output lead940, the HU voice signal is only presented to ASR engine 906 whichgenerates automated text corresponding to the HU voice which is thenprovided to a voice to text synchronizer 910. Here, synchronizer 910simply passes the raw ASR text on through a correctable text window 916to the AU device 12.

Referring again to FIG. 28 , with switch 904 connected to output lead942, the HU voice signal, in addition to being linked to the ASR engine,is presented to CA 908 for generating and correcting text viatraditional CA voice recognition 920 and manual correction tools 924 viacorrection window 922. Here, corrected text is provided to the AU device12 and is also provided to a text comparison unit or module 930. Rawtext from the ASR engine 906 is presented to comparison unit 930.Comparison unit 930 compares the two text streams received andcalculates an ASR error rate which is output to switch control 932.Here, where the ASR error rate is low (e.g., below some threshold),control 932 may be controlled to cut the text generating CA 908 out ofthe captioning process.

Referring still to FIG. 28 , with switch 904 connected to output lead944, the HU voice signal, in addition to being linked to the ASR engine,is fed through synchronizer 910 which delays the HU voice signal so thatthe HU voice signal lags the raw ASR text by a short period (e.g., 2seconds). The delayed HU voice signal is provided to a CA 912 chargedwith correcting ASR text generated by engine 906. The CA 912 uses akeyboard or the like 914 to correct any perceived errors in the raw ASRtext presented in window 916. The corrected text is provided to the AUdevice 12 and is also provided to the text comparison unit 930 forcomparison to the raw ASR text. Again, comparison unit 930 generates anASR error rate which is used by control 932 to operate switch device904. The manual corrections by CA 912 are provided to a CA errortracking unit 918 which counts the number of errors corrected by the CAand compares that number to the total number of words generated by theASR engine 906 to calculate a CA correction rate for the ASR generatedraw text. The correction rate is provided to control 932 which uses thatrate to control switch device 904.

Thus, in operation, when an HU-AU call first requires captioning, in atleast some cases switch device 904 will be linked to output lead 942 sothat full CA transcription and correction occurs in parallel with theASR engine generating raw ASR text for the HU voice signal. Here, asdescribed above, the ASR engine may be programmed to compare the raw ASRtext and the CA generated text and to train to the HU's voice signal sothat, over a relatively short period, the error rate generated bycomparison unit 930 drops. Eventually, once the error rate drops belowsome rate threshold, control 932 controls device 940 to link to outputlead 944 so that CA 908 is taken out of the captioning path and CA 912is added. CA 912 receives the raw ASR text and corrects that text whichis sent on to the AU device 12. As the CA corrects text, the ASR enginecontinues to train to the HU voice using the corrected errors.Eventually, the ASR accuracy should improve to the point where thecorrection rate calculated by tracking unit 918 is below some threshold.Once the correction rate is below the threshold, control 932 may controlswitch 904 to link to output link 940 to take the CA 912 out of thecaptioning loop which causes the relatively accurate raw ASR text to befed through to the AU device 12. As described above in at least somecases the AU and perhaps a CA or the HU may be able to manually switchbetween captioning processes to meet preferences or to address perceivedcaptioning problems.

As described above, it has been recognized that at least some ASRengines are more accurate and more resilient during the first 30 +/−seconds of performing voice to text transcription. If an HU takes aspeaking turn that is longer than 30 seconds the engine has a tendencyto freeze or lag. To deal with this issue, in at least some embodiments,all of an HU's speech or voice signal may be fed into an audio bufferand a system processor may examine the HU voice signal to identify anysilent periods that exceed some threshold duration (e.g., 2 seconds).Here, a silent period would be detected whenever the HU voice signalaudio is out of a range associated with a typical human voice. When asilent period is identified, in at least some cases the ASR engine isrestarted and a new ASR session is created. Here, because the processuses an audio buffer, no portion of the HU's speech or voice signal islost and the system can simply restart the ASR engine after theidentified silent period and continue the captioning process afterremoving the silent period.

Because the ASR engine is restarted whenever a silent period of at leasta threshold duration occurs, the system can be designed to have severaladvantageous features. First, the system can implement a dynamic andconfigurable range of silence or gap threshold. For instance, in somecases, the system processor monitoring for a silent period of a certainthreshold duration can initially seek a period that exceeds some optimalrelatively long length and can reduce the length of the thresholdduration as the ASR captioning process nears a maximum period prior torestarting the engine. Thus, for instance, where a maximum ASR enginecaptioning period is 30 seconds, initially the silent period thresholdduration may be 3 seconds. However, after an initial 20 seconds ofcaptioning by an engine, the duration may be reduced to 1.5 seconds.Similarly, after 25 seconds of engine captioning, the threshold durationmay be reduced further to one half a second.

As another instance, because the system uses an audio buffer in thiscase, the system can “manufacture” a gap or silent period in which torestart an ASR engine, holding an HU's voice signal in the audio bufferuntil the ASR engine starts captioning anew. While the manufacturedsilent period is not as desirable as identifying a natural gap or silentperiod as described above, the manufactured gap is a viable option ifnecessary so that the ASR engine can be restarted without loss of HUvoice signal.

In some cases it is contemplated that a hybrid silent period approachmay be implemented. Here, for instance, a system processor may monitorfor a silent period that exceeds 3 seconds in which to restart an ASRengine. If the processor does not identify a suitable 3-plus secondperiod for restarting the engine within 25 seconds, the processor maywait until the end of any word and manufacture a 3 second period inwhich to restart the engine.

Where a silent period longer than the threshold duration occurs and theASR engine is restarted, if the engine is ready for captioning prior tothe end of the threshold duration, the processor can take out the end ofthe silent period and begin feeding the HU voice signal to the ASRengine prior to the end of the threshold period. In this way, theprocessor can effectively eliminate most of the silent period so thatcaptioning proceeds quickly.

Restarting an ASR engine at various points within an HU voice signal hasthe additional benefit of making all hypothesis words (e.g., initiallyidentified words prior to contextual correction based on subsequentwords) firm in at least some embodiments. Doing so allows a CAcorrecting the text to make corrections or any other manipulationsdeemed appropriate for an AU immediately without having to wait forautomated contextual corrections and avoids a case where a CA errorcorrection may be replaced subsequently by an ASR engine correction.

In still other cases other hybrid systems are contemplated where aprocessor examines an HU voice signal for suitably long silent periodsin which to restart an ASR engine and, where no such period occurs by acertain point in a captioning process, the processor commences anotherASR engine captioning process which overlaps the first process so thatno HU voice signal is lost. Here, the processor would work out whichcaptioned words are ultimately used as final ASR output during theoverlapping periods to avoid duplicative or repeated text.

Return on Audio Detector Feature

One other feature that may be implemented in some embodiments of thisdisclosure is referred to as a Return On Audio detector (ROA-Detector)feature. In this regard, a system processor receiving an HU voice signalascertains whether or not the signal includes audio in a range that istypical for human speech during an HU turn and generates a duration ofspeech value equal to the number of seconds of speech received. Thus,for instance, in a ten second period corresponding to an HU voice signalturn, there may be 3 seconds of silence during which audio is not in therange of typical human speech and therefore the duration of speech valuewould be 7 seconds. In addition, the processor detects the quantity ofcaptions being generated by an ASR engine. The processor automaticallycompares the quantity of captions from the ASR with the duration ofspeech value to ascertain if there is a problem with the ASR engine.Thus, for instance, if the quantity of ASR generated captions issubstantially less than would be expected given the duration of speechvalue, a potential ASR problem may be identified. The idea here is thatif the duration of speech value is low (e.g., 4 out of 10 seconds) whilethe caption quality value (based on CA error corrections or some otherfactor(s)) is also low, the low caption quality value is likely notassociated with the quantity of speech signal to be captioned andinstead is likely associated with an ASR problem. Where an ASR problemis likely, the likely problem may be used by the processor to trigger arestart of the ASR engine to generate a better result. As analternative, where an ASR problem is likely, the problem may triggerinitiation of a whole new ASR session. As still one other alternative, alikely ASR problem may trigger a process to bring a CA on lineimmediately or more quickly than would otherwise be the case.

In still other cases, when a likely ASR error is detected as indicatedabove, the ROA detector may retrieve the audio (i.e., the HU voicesignal) that was originally sent to the ASR from a rolling buffer andreplay/resend the audio to the ASR engine. This replayed audio would besent through a separate session simultaneously with any new sessionsthat are sending ongoing audio to the ASR. Here, the captionscorresponding to the replayed audio would be sent to the AU device andinserted into a correct sequential slot in the captions presented to theAU. In addition, here, the ROA detector would monitor the text thatcomes back from the ASR and compare that text to the text retrievedduring the prior session, modifying the captions to remove redundancies.Another option would be for the ROA to simply deliver a message to theAU device indicating that there was an error and that a segment of audiowas likely not properly captioned. Here, the AU device would present thelikely erroneous captions in some way that indicates a likely error(e.g., perhaps visually distinguished by a yellow highlight or thelike).

In some cases it is contemplated that a phone user may want to have justin time (JIT) captions on their phone or other communication device(e.g., a tablet) during a call with an HU for some reason. For instance,when a smart phone user wants to remove a smart phone from her ear for ashort period the user may want to have text corresponding to an HU'svoice presented during that period. Here, it is contemplated that avirtual “Text” or “Caption” button may be presented on the smart phonedisplay screen or a mechanical button may be presented on the devicewhich, when selected causes an ASR to generate text for a preset periodof time (e.g. 10 seconds) or until turned off by the device user. Here,the ASR may be on the smart phone device itself, may be at a relay or atsome other deice (e.g., the HU's device). In other cases where a smartphone includes a motion sensor device or other sensor that can detectwhen a user moves the device away from her ear or when the user looks atthe device (e.g., a face recognition or eye gaze sensor), the system mayautomatically present text to the AU upon a specific motion (e.g.,pulling away from the user's ear) or upon recognizing that the user islikely looking at a display screen on the AU's device.

While HU voice profiles may be developed and stored for any HU callingan AU, in some embodiments, profiles may only be stored for a small setof HUs, such as, for instance, a set of favorites or contacts of an AU.For instance, where an AU has a list of ten favorites, HU voice profilesmay be developed, maintained, and morphed over time for each of thosefavorites. Here, again, the profiles may be stored at differentlocations and by different devices including the AU device, a relay, viaa third party service provider, or even an HU device where the HUearmarks certain AUs as having the HU as a favorite or a contact.

In some cases it may be difficult technologically for a CA to correctASR captions. Here, instead of a CA correcting captions, another optionwould simply be for a CA to mark errors in ASR text as wrong and movealong. Here, the error could be indicated to an AU via the display on anAU's device. In addition, the error could be used to train an HU voiceprofile and/or captioning model as described above. As anotheralternative, where a CA marks a word wrong, a correction engine maygenerate and present a list of alternative words for the CA to choosefrom. Here, using an on screen tool, the CA may select a correct wordoption causing the correction to be presented to an AU as well ascausing the ASR to train to the corrected word.

Metrics—Tracking and Reporting CA and ASR Accuracy

In at least some cases it is contemplated that it may be useful to runperiodic tests on CA generated text captions to track CA accuracy orreliability over time. For instance, in some cases CA reliabilitytesting can be used to determine when a particular CA could useadditional or specialized training. In other cases, CA reliabilitytesting may be useful for determining when to cut a CA out of a call tobe replaced by automatic speech recognition (ASR) generated text. Inthis regard, for instance, if a CA is less reliable than an ASRapplication for at least some threshold period of time, a systemprocessor may automatically cut the CA out even if ASR quality remainsbelow some threshold target quality level if the ASR quality ispersistently above the quality of CA generated text. As anotherinstance, where CA quality is low, text from the CA may be fed to asecond CA for either a first or second round of corrections prior totransmission to an AU device for display or, a second relatively moreskilled CA trained in handling difficult HU voice signals may be swappedinto the transcription process in order to increase the quality level ofthe transcribed text. As still one other instance, CA reliabilitytesting may be useful to a governing agency interested in tracking CAaccuracy for some reason.

In at least some cases it has been recognized that in addition toassessing CA captioning quality, it will be useful to assess howaccurately an automated speech recognition system can caption the sameHU voice signal regardless of whether or not the quality values are usedto switch the method of captioning. For instance, in at least some casesline noise or other signal parameters may affect the quality of HU voicesignal received at a relay and therefore, a low CA captioning qualitymay be at least in part attributed to line noise and other signalprocessing issues. In this case, an ASR quality value for ASR generatedtext corresponding to the HU voice signal may be used as an indicationof other parameters that affect CA captioning quality and therefore inpart as a reason or justification for a low CA quality value. Forinstance, where an ASR quality value is 75% out of 100% and a CA qualityvalue is 87% out of 100%, the low ASR quality value may be used to showthat, in fact, given the relatively higher CA quality value, that the CAvalue is quite good despite being below a minimum target threshold. Linenoise and other parameters may be measured in more direct ways via linesensors at a relay or elsewhere in the system and parameter valuesindicative of line noise and other characteristics may be stored alongwith CA quality values to consider when assessing CA caption quality.

Several ways to test CA accuracy and generate accuracy statistics arecontemplated by the present disclosure. One system for testing andtracking accuracy may include a system where actual or simulated HU-AUcalls are recorded for subsequent testing purposes and where HU turns(e.g., voice signal periods) in each call are transcribed and correctedby a CA to generate a true and highly accurate (e.g., approximately 100%accurate) transcription of the HU turns that is referred to hereinafteras the “truth”. Here, metrics on the HU voice message speed, dynamicduration of speech value, complexity of voice message words, quality ofvoice message signal, voice message pitch, tone, etc., can all bepredetermined and used to assess CA accuracy as well as to identifyspecific call types with specific characteristics that a CA does bestwith and others that the assistant has relatively greater difficultyhandling.

During testing, without a CA knowing that a test is being performed, thetest recording is presented to the CA as a new AU-HU call for captioningand the CA perceives the recording to be a typical HU-AU call. In manycases, a large number of recorded calls may be generated and stored foruse by the testing system so that a CA never listens to the same testrecording more than once. In some cases a system processor may track CAsand which test recordings the CA has been exposed to previously and mayensure that a CA only listens to any test recording once.

As a CA listens to a test recording, the CA transcribes the HU voicesignal to text and, in at least some cases, makes corrections to thetext. Because the CA generated text corresponds to a recorded voicesignal and not a real time signal, the text is not forwarded to an AUdevice for display. The CA is unaware that the text is not forwarded tothe AU device as this exercise is a test. The CA generated text iscompared to the truth and a quality value is generated for the CAgenerated text (hereinafter a “CA quality value”). For instance, the CAquality value may be a percent accuracy representing the percent of HUvoice signal words accurately transcribed to text. The CA quality valuemay also be affected by other factors like speed of the voice message,dynamic duration of speech value, complexity of voice message words,quality of voice message signal, voice message pitch, tone, etc.

In at least some cases different CA quality values may be generated fora single CA where each value is associated with a different subset ofvoice message and captioning characteristics. For instance, in a simplecase, a first CA may have a high caption quality value associated withhigh pitch voices and a relatively lower caption quality valueassociated with low pitch voices. The same first CA may have arelatively high caption quality value for high pitched voices where aduration of speech value is relatively low (e.g., less than 50%) whencompared to the quality value for a high pitched voice where theduration of speech value is relatively high (e.g., greater than 50%).Many other voice message characteristic subsets for qualifying captionquality values are contemplated.

The multiple caption quality values can be used to identify specificcall types with specific characteristics that a CA does best with andothers that the assistant has relatively greater difficulty handling.Incoming calls can be routed to CAs that are optimized (e.g., availableand highly effective for calls with specific characteristics) to handlethose calls. CA caption quality values and associated voice messagecharacteristics are stored in a data base for subsequent access.

In addition to generating one or more CA quality values that representhow accurately a CA transcribes voice to text, in at least some casesthe system will be programmed to track and record transcription latencythat can be used as a second type of quality factor referred tohereinafter as the “CA latency value”. Here, the system may trackinstantaneous latency and use the instantaneous values to generateaverage and other statistical latency values. For instance, an averagelatency over an entire call may be calculated, an average latency over amost recent one minute period may be calculated, a maximum latencyduring a call, a minimum latency during a call, a latency average takingout the most latent 20% and least latent 20% of a call may be calculatedand stored, etc. In some cases where both a CA quality value and CAlatency values are generated, the system may combine the quality andlatency values according to some algorithm to generate an overall CAservice value that reflects the combination of accuracy and latency.

CA latency may also be calculated in other ways. For instance, in atleast some cases a relay server may be programmed to count the number ofwords during a period that are received from an ASR service provider(see 1006 in FIG. 30 ) and to assume that the returned number of wordsover a minute duration represents the actual words per minute (WPM)spoken by an HU. Here, periods of HU silence may be removed from theperiod so that the word count more accurately reflects WPM of thespeaking HU. Then, the number of words generated by a CA for the sameperiod may be counted and used along with the period duration minussilent periods to determine a CA WPM count. The server may then comparethe HU's WPM to the CA WPM count to assess CA delay or latency.

Where actual calls are used to generate CA metrics, in at least somecases call content is not persistently stored as either voice or textfor subsequent access. Instead, in these cases, only audio, caption andcorrection timing information (e.g., delay durations) is stored for eachcall. In other cases, in addition to the timing information, callcharacteristics (e.g., Hispanic voice, HU WPM rate, line signal quality,HU volume, tone, etc.) and/or error types (e.g., visible, invisible,minor, etc.) for each corrected and missed error may be stored.

Where pre-recorded test calls are used to generate CA metrics, in atleast some cases in addition to storing the timing, call characters anderror types for each call, the system may store the complete text callaudio record with time stamps, captioning record and corrections recordso that a system administrator has the ability to go back and viewcaptioning and correction for an entire call to gain insights related toCA strengths and weaknesses.

In at least some cases the recorded call may also be provided to an ASRto generate automatic text. The ASR generated text may also be comparedto the truth and an “ASR quality value” may be generated. The ASRquality value may be stored in a database for subsequent use or may becompared to the CA quality value to assess which quality value is higheror for some other purpose. Here, also, an ASR latency value or ASRlatency values (e.g., max, min, average over a call, average over a mostrecent period, etc.) may be generated as well as an overall ASR servicevalue. Again, the ASR and CA values may be used by a system processor todetermine when the ASR generated text should be swapped in for the CAgenerated text and vice versa.

Referring now to FIG. 29 , an exemplary system 1000 for testing andtracking CA and ASR quality and latency values using pre-recorded HU-AUcalls is illustrated. System 1000 includes relay components representedby the phantom box at 1001 and a cloud based ASR system 1006 (e.g., aserver that is linked to via the internet or some other type ofcomputing network). Two sources of pre-generated information aremaintained at the relay including a set of recorded calls at 1002 and aset of verified true transcripts at 1010, one truth or true transcriptfor each recorded call in the set 1002. Again, the recorded calls mayinclude actual HU-AU calls or may include mock calls that occur betweentwo knowing parties that simulate an actual call.

During testing, a connection is linked from a system server that storesthe calls 1002 to a captioning platform as shown at 1004 and one of therecorded calls, hereinafter referred to as a test recording, istransmitted to the captioning platform 1004. The captioning platform1004 sends the received test recording to two targets including a CA at1008 and the ASR server 1006 (e.g., Google Voice, IBM's Watson, etc.).The ASR generates an automated text transcript that is forwarded on to afirst comparison engine at 1012. Similarly, the CA generates CAgenerated text which is forwarded on to a second comparison engine 1014.The verified truth text transcript at 1010 is provided to each of thefirst and second comparison engines 1012 and 1014. The first engine 1012compares the ASR text to the truth and generates an ASR quality valueand the second engine 1014 compares the CA generated text to truth andgenerates a CA quality value, each of which are provided to a systemdatabase 1016 for storage until subsequently required.

In addition, in some cases, some component within the system 1000generates latency values for each of the ASR text and the CA generatedtext by comparing when the times at which words are uttered in the HUvoice signal to the times at which the text corresponding thereto isgenerated. The latency values are represented by clock symbols 1003 and1005 in FIG. 29 . The latency values are stored in the database 1016along with the associated ASR and CA quality values generated by thecomparison engines 1012 and 1014.

Another way to test CA quality contemplated by the present disclosure isto use real time HU-AU calls to generate quality and latency values. Inthese cases, a first CA may be assigned to an ongoing HU-AU call and mayoperate in a conventional fashion to generate transcribed text thatcorresponds to an HU voice signal where the transcribed text istransmitted back to the AU device for display substantiallysimultaneously as the HU voice is broadcast to the AU. Here, the firstCA may perform any process to convert the HU voice to text such as, forinstance, revoicing the HU voice signal to a processor that runs voiceto text software trained to the voice of the HU to generate text andthen correcting the text on a display screen prior to sending the textto the AU device for display. In addition, the CA generated text is alsoprovided to a second CA along with the HU voice signal and the second CAlistens to the HU voice signal and views the text generated by the firstCA and makes corrections to the first CA generated text. Having beencorrected a second time, the text generated by the second CA is asubstantially error free transcription of the HU voice signal referredto hereinafter as the “truth”. The truth and the first CA generated textare provided to a comparison engine which then generates a “CA qualityvalue” similar to the CA quality value described above with respect toFIG. 29 which is stored for subsequent access in a database.

In addition, as is the case in FIG. 29 , in the case of transcribing anongoing HU-AU call, the HU voice signal may also be provided to a cloudbased ASR server or service to generate automated speech recognitiontext during an ongoing call that can be compared to the truth (e.g., thesecond CA generated text) to generate an ASR quality value. Here, whileconventional ASRs are fast, there will again be some latency in textgeneration and the system will be able to generate an ASR latency value.

Referring now to FIG. 30 , an exemplary system 1020 for testing andtracking CA and ASR quality and latency values using ongoing HU-AU callsis illustrated. Components in the FIG. 30 system 1020 that are similarto the components described above with respect to FIG. 29 are labeledwith the same numbers and operate in a similar fashion unless indicatedotherwise hereafter. In addition to an HU communication device 1040 andan AU communication device 1042 (e.g., a caption type telephone device),system 1020 includes relay components represented by the phantom box at1021 and a cloud based ASR system 1006 akin to the cloud based systemdescribed above with respect to FIG. 29 . Here there is no pre-generatedand recorded call or pre-generated truth text as testing is done usingan ongoing dynamic call. Instead, a second CA at 1030 corrects textgenerated by a first CA at 1008 to create a truth (e.g., essentially100% accurate text). The truth is compared to ASR generated text and thefirst CA generated text to create quality values to be stored indatabase 1016.

Referring still to FIG. 30 , during testing, as in a conventional relayassisted captioning system, the AU device 1042 transmits an HU voicesignal to the captioning platform at 1004. The captioning platform 1004sends the received HU voice signal to two targets including a first CAat 1008 and the ASR server 1006 (e.g., Google Voice, IBM's Watson,etc.). The ASR generates an automated text transcript that is forwardedon to a first comparison engine at 1012. Similarly, the first CAgenerates CA generated text which is transmitted to at least threedifferent targets. First, the first CA generated text which may includetext corrected by the first CA is transmitted to the AU device 1042 fordisplay to the AU during the call. Second, the first CA generated textis transmitted to the second comparison engine 1014. Third, the first CAgenerated text is transmitted to a second CA at 1030. The second CA at1030 views the CA generated text on a display screen and also listens tothe HU voice signal and makes corrections to the first CA generated textwhere the second CA generated text operates as a truth text or truth.The truth is transmitted to the second comparison engine at 1014 to becompared to the first CA generated text so that a CA quality value canbe generated. The CA quality value is stored in database 1016 along withone or more CA latency values.

Referring again to FIG. 30 , the truth is also transmitted from thesecond CA at 1030 to the first comparison engine at 1012 to be comparedto the ASR generated text so that an ASR quality value is generatedwhich is also stored along with at least one ASR latency value in thedatabase 1016.

Referring to FIG. 31 , another embodiment of a testing relay system isshown at 1050 which is similar to the system 1020 of FIG. 30 , albeitwhere the ASR service 1006 provides an initial text transcription to thesecond CA at 1052 instead of the CA receiving the initial text from thefirst CA. Here, the second CA generated the truth text which is againprovided to the two comparison engines at 1012 and 1014 so that ASR andCA quality factors can be generated to be stored in database 1016.

The ASR text generation and quality testing processes are describedabove as occurring essentially in real time as a first CA generates textfor a recorded or ongoing call. Here, real time quality and latencytesting may be important where a dynamic triage transcription process isoccurring where, for instance, ASR generated text may be swapped in fora cut out CA when ASR generated text achieves some quality threshold ora CA may be swapped in for ASR generated text if the ASR quality valuedrops below some threshold level. In other cases, however, qualitytesting may not need to be real time and instead, may be able to be doneoff line for some purposes. For instance, where quality testing is onlyused to provide metrics to a government agency, the testing may be doneoff line.

In this regard, referring again to FIG. 29 , in at least some caseswhere testing cannot be done on the fly as a CA at 1008 generates text,the CA text and the recorded HU voice signal associated therewith may bestored in database 1016 for subsequent access for generating the ASRtext at 1006 as well as for comparing the CA generated text and the ASRgenerated text to the verified truth text from 1010. Similarly,referring again to FIG. 30 , where real time quality and latency valuesare not required, at least the HU portion of a call may be stored indatabase 1016 for subsequent off line processing by ASR service 1006 andthe second CA at 1030 and then for comparisons to the truth at engines1012 an 1014.

It should be appreciated that current there are Federal and stateregulations that prohibit storage of any parts of voice communicationsbetween two or more people without authorization from at least one ofthose persons. For this reason, in at least some cases it iscontemplated that real voice recordings of AU-HU calls may only be usedfor training purposes after authorization is sought and received. Here,the same recording may be used to train multiple CAs. In other cases,“fake” AU-HU call recordings may be generated and used for trainingpurposes so that regulations and AU and HU privacy concerns cannot beviolated. Here, true transcripts of the fake calls can be generated andstored for use in assessing CA caption quality. One advantage of fakecall records is that different qualities of HU voice signals can besimulated automatically to see how those affect CA caption accuracyspeed, etc. For instance, a first CA may be much more accurate andfaster than a second CA at captioning standard or poor definition orquality voice signals.

One advantage of generating quality and latency values in real timeusing real HU-AU calls is that there is no need to store calls forsubsequent processing. Currently there are regulations in at least somejurisdictions that prohibit storing calls for privacy reasons andtherefore off line quality testing cannot be done in these cases.

In at least some embodiments it is contemplated that quality and latencytesting may only be performed sporadically and generally randomly sothat generated values are sort of an average representation of theoverall captioning service. In other cases, while quality and latencytesting may be periodic in general, it is contemplated that tell tailsigns of poor quality during transcription may be used to triggeradditional quality and latency testing. For instance, in at least somecases where an AU is receiving ASR generated text and the AU selects anoption to link to a CA for correction, the AU request may be used as atrigger to start the quality testing process on text received from thatpoint on (e.g., quality testing will commence and continue for HU voicereceived as time progresses forward). Similarly, when an AU requestsfull CA captioning (e.g., revoicing and text correction), qualitytesting may be performed from that point forward on the CA generatedtext.

In other cases, it is contemplated that an HU-AU call may be storedduring the duration of the call and that, at least initially, no qualitytesting may occur. Then, if an AU requests CA assistance, in addition topatching a CA into the call to generate higher quality transcription,the system may automatically patch in a second CA that generates truthtext as in FIG. 30 for the remainder of the call. In addition orinstead, when the AU requests CA assistance, the system may, in additionto patching a CA in to generate better quality text, also cause therecorded HU voice prior to the request to be used by a second CA togenerate truth text for comparison to the ASR generated text so that anASR quality value for the text that caused the AU to request assistancecan be generated. Here, the pre-CA assistance ASR quality value may begenerated for the entire duration of the call prior to the request orjust for a most recent sub-period (e.g., for the prior minute or 30seconds). Here, in at least some cases, it is contemplated that thesystem may automatically erase any recorded portion of an HU-AU callimmediately after any quality values associated therewith have beencalculated. In cases where quality values are only calculated for a mostrecent period of HU voice signal, recordings prior thereto may be erasedon a rolling basis.

As another instance, in at least some cases it is contemplated thatsensors at a relay may sense line noise or other signal parameters and,whenever the line noise or other parameters meet some threshold level,the system may automatically start quality testing which may persistuntil the parameters no longer meet the threshold level. Here, there maybe hysteresis built into the system so that once a threshold is met, atleast some duration of HU voice signal below the threshold is requiredto halt the testing activities. The parameter value or condition orcircumstance that triggered the quality testing would, in this case, bestored along with the quality value and latency information to addcontext to why the system started quality testing in the specificinstance.

As one other example, in a case where an AU signals dissatisfaction witha captioning service at the end of a call, quality testing may beperformed on at least a portion of the call. To this end, in at leastsome cases as an HU-AU call progresses, the call may be recordedregardless of whether or not ASR or CA generated text is presented to anAU. Then, at the end of a call, a query may be presented to the AUrequesting that the AU rate the AU's satisfaction with the call andcaptioning on some scale (e.g., a 1 through 10 quality scale with 10being high). Here, if a satisfaction rating were low (e.g., less than 7)for some reason, the system may automatically use the recorded HU voiceor at least a portion thereof to generate a CA quality value in one ofthe ways described above. For instance, the system may provide the textgenerated by a first CA or by the ASR and the recorded HU voice signalto a second CA for generating truth and a quality value may be generatedusing the truth text for storage in the database.

In still other cases where an AU expresses a low satisfaction rating fora captioning service, prior to using a recorded HU voice signal togenerate a quality value, the system server may request authorization touse the signal to generate a captioning quality value. For instance,after an AU indicates a 7 (out of 10) or lower on a satisfaction scale,the system may query the AU for authorization to check captioningquality by providing a query on the AU's device display and “Yes” and“No” options. Here, if the yes option is selected, the system wouldgenerate the captioning quality value for the call and memorialize thatvalue in the system database 1016. In addition, if the system identifiessome likely factor in a low quality assessment, the system maymemorialize that factor and present some type of feedback indicating thefactor as a likely reason for the low quality value. For instance, ifthe system determines that the AU-HU link was extremely noisy, thatfactor may be memorialized and indicated to the AU as a reason for thepoor quality captioning service.

As another instance, because it is the HU's voice signal that isrecorded (e.g., in some cases the AU voice signal may not be recorded)and used to generate the captioning quality value, authorization to usethe recording to generate the quality value may be sought from an HU ifthe HU is using a device that can receive and issue an authorizationrequest at the end of a call. For instance, in the case of a call wherean HU uses a standard telephone, if an AU indicates a low satisfactionrating at the end of a call, the system may transmit an audio recordingto the HU requesting authorization to use the HU voice signal togenerate the quality value along with instructions to select “1” for yesand “2” for no. In other cases where an HU's device is a smart phone orother computing type device, the request may include text transmitted tothe HU device and selectable “Yes” and “No” buttons for authorizing ornot.

While an HU-AU call recording may be at least temporarily stored at arelay, in other cases it is contemplated that call recordings may bestored at an AU device or even at an HU device until needed to generatequality values. In this way, an HU or AU may exercise more control or atleast perceive to exercise more control over call content. Here, forinstance, while a call may be recorded, the recording device may notrelease recordings unless authorization to do so is received from adevice operator (e.g., an HU or an AU). Thus, for instance, if the HUvoice signal for a call is stored on an HU device during the call and,at the end of a call an AU expresses low satisfaction with thecaptioning service in response to a satisfaction query, the system mayquery the HU to authorize use of the HU voice to generate captioningquality values. In this case, if the HU authorizes use of the HU voicesignal, the recorded HU voice signal would be transmitted to the relayto be used to generate captioning quality values as described above.Thus, the HU or AU device may serve as a sort of software vault for HUvoice signal recordings that are only released to the relay after properauthorization is received from the HU or the AU, depending on systemrequirements.

As generally known in the industry, voice to text software accuracy ishigher for software that is trained to the voice of a speaking person.Also known is that software can train to specific voices over shortdurations. Nevertheless, in most cases it is advantageous if softwarestarts with a voice model trained to a particular voice so that captionaccuracy can start immediately upon transcription. Thus, for instance,in FIG. 30 , when a specific HU calls an AU to converse, it would beadvantageous if the ASR service at 1006 had access to a voice model forthe specific HU. One way to do this would be to have the ASR service1006 store voice models for at least HUs that routinely call an AU(e.g., a top ten HU list for each AU) and, when an HU voice signal isreceived at the ASR service, the service would identify the HU voicesignal either using recognition software that can distinguish once voicefrom others or via some type of an identifier like the phone number ofthe HU device used to call the AU. Once the HU voice is identified, theASR service accesses an HU voice model associated with the HU voice anduses that model to perform automated captioning.

One problem with systems that require an ASR service to store HU voicemodels is that HUs may prefer to not have their voice models stored bythird party ASR service providers or at least to not have the modelsstored and associated with specific HUs. Another problem may be thatregulatory agencies may not allow a third party ASR service provider tomaintain HU voice models or at least models that are associated withspecific HUs. Once solution is that no information useable to associatean HU with a voice model may be stored by an ASR service provider. Here,instead of using an HU identifier like a phone number or other networkaddress associated with an HU's device to identify an HU, an ASR servermay be programmed to identify an HU's voice signal from analysis of thevoice signal itself in an anonymous way. It is contemplated that voicemodels may be developed for every HU that calls an AU and may be storedin the cloud by the ASR service provider. Even in cases where there arethousands of stored voice models, an HU's specific model should bequickly identifiable by a processor or server.

Another solution may be for an AU device to store HU voice models forfrequent callers where each model is associated with an HU identifierlike a phone number or network address associated with a specific HUdevice. Here, when a call is received at an AU device, the AU deviceprocessor may use the number or address associated with the HU device toidentify which voice model to associate with the HU device. Then, the AUdevice may forward the HU voice model to the ASR service provider 1006to be used temporarily during the call to generate ASR text. Similarly,instead of forwarding an HU voice model to the ASR service provider, theAU device may simply forward an intermediate identification number orother identifier associated with the HU device to the ASR provider andthe provider may associate the number with a specific HU voice modelstored by the provider to access an appropriate HU voice model to usefor text transcription. Here, for instance, where an AU supports tendifferent HU voice models for 10 most recent HU callers, the models maybe associated with number 1 through 10 and the AU may simply forward onone of the intermediate identifiers (e.g., “7”) to the ASR provider 1006to indicate which one of ten voice models maintained by the ASR providerfor the AU to use with the HU voice transmitted.

In other cases an ASR may develop and store voice models for each HUthat calls a specific AU in a fashion that correlates those models withthe AU's identity. Then when the ASR provider receives a call from andAU caption device, the ASR provider may identify the AU and associatedHU voice models and use those models to identify the HU on the call andthe model associated therewith.

In still other cases an HU device may maintain one or more HU voicemodels that can be forwarded on to an ASR provider either through therelay or directly to generate text.

Visible and Invisible Voice to Text Errors

In at least some cases other more complex quality analysis andstatistics are contemplated that may be useful in determining betterways to train CAs as well as in assessing CA quality values. Forinstance, it has been recognized that voice to text errors can generallybe split into two different categories referred to herein as “visible”and “invisible” errors. Visible errors are errors that result in textthat, upon reading, is clearly erroneous while invisible errors areerrors that result in text that, despite the error that occurred, makessense in context. For instance, where an HU voices the phrase “We aremeeting at Joe's restaurant at 9 PM”, in a text transcription “We aremeeting at Joe's rodent for pizza at 9 PM”, the word “rodent” is a“visible” error in the sense that an AU reading the phrase would quicklyunderstand that the word “rodent” makes no sense in context. On theother hand, if the HU's phrase were transcribed as “We are meeting atJoe's room for pizza at 9 PM”, the erroneous word “room” is notcontextually wrong and therefore cannot be easily discerned as an error.Where the word “restaurant” is erroneously transcribed as “room”, an AUcould easily get a wrong impression and for that reason invisible errorsare generally considered worse than visible errors.

In at least some cases it is contemplate that some mechanism fordistinguishing visible and invisible text transcription errors may beincluded in a relay quality testing system. For instance, where 10errors are made during some sub-period of an HU-AU call, three of theerrors may be identified as invisible while 7 are visible. Here, becauseinvisible errors typically have a worse effect on communicationeffectiveness, statistics that capture relative numbers of invisible toall errors should be useful in assessing CA or ASR quality.

In at least some systems it is contemplated that a relay server may beprogrammed to automatically identify at least visible errors so thatstatistics related thereto can be captured. For instance, the server maybe able to contextually examine text and identify words of phrases thatsimply make no sense and may identify each of those nonsensical errorsas a visible error. Here, because invisible errors make contextualsense, there is no easy algorithm by which a processor or server canidentify invisible errors. For this reason in at least some cases acorrecting CA (See 1053 in FIG. 31 ) may be required to identifyinvisible errors or, in the alternative, the system may be programmed toautomatically use CA corrections to identify invisible errors. In thisregard, any time a CA changes a word in a text phrase that initiallymade sense within the phrase to another word that contextually makessense in the phrase, the system may recognize that type of correction tohave been associated with an invisible error.

In at least some cases it is contemplated that the decision to switchcaptioning methods may be tied at least in part to the types of errorsidentified during a call. For instance, assume that a CA is currentlygenerating text corresponding to an HU voice signal and that an ASR iscurrently training to the HU voice signal but is not currently at a highenough quality threshold to cut out the CA transcription process. Here,there may be one threshold for the CA quality value generally andanother for the CA invisible error rate where, if either of the twothresholds are met, the system automatically cuts the CA out. Forexample, the threshold CA quality value may require 95% accuracy and theCA invisible error rate may be 20% coupled with a 90% overall accuracyrequirement. Thus, here, if the invisible error rate amounts to 20% orless of all errors and the overall CA text accuracy is above 90% (e.g.,the invisible error rate is less than 2% of all words uttered by theHU), the CA may be cut out of the call and ASR text relied upon forcaptioning. Other error types are contemplated and a system fordistinguishing each of several errors types from one another forstatistical reporting and for driving the captioning triage process arecontemplated.

In at least some cases when to transition from CA generated text to ASRgenerated text may be a function of not just a straight up comparison ofASR and CA quality values and instead may be related to both quality andrelative latency associated with different transcription methods. Inaddition, when to transition in some cases may be related to acombination of quality values, error types and relative latency as wellas to user preferences.

Other triage processes for identifying which HU voice to text methodshould be used are contemplated. For instance, in at least someembodiments when an ASR service or ASR software at a relay is being usedto generate and transmit text to an AU device for display, if an ASRquality value drops below some threshold level, a CA may be patched into the call in an attempt to increase quality of the transcribed text.Here, the CA may either be a full revoicing and correcting CA, just acorrecting CA that starts with the ASR generated text and makescorrections or a first CA that revoices and a second CA that makescorrections. In a case where a correcting CA is brought into a call, inat least some cases the ASR generated text may be provided to the AUdevice for display at the same time that the ASR generated text is sentto the CA for correction. In that case, corrected text may betransmitted to the AU device for in line correction once generated bythe CA. In addition, the system may track quality of the CA correctedtext and store a CA quality value in a system database.

In other cases when a CA is brought into a call, text may not betransmitted to the AU device until the CA has corrected that text andthen the corrected text may be transmitted.

In some cases, when a CA is linked to a call because the ASR generatedtext was not of a sufficiently high quality, the CA may simply startcorrecting text related to HU voice signal received after the CA islinked to the call. In other cases the CA may be presented with textassociated with HU voice signal that was transcribed prior to the CAbeing linked to the call for the CA to make corrections to that text andthen the CA may continue to make corrections to the text as subsequentHU voice signal is received.

Thus, as described above, in at least some embodiments an HU'scommunication device will include a display screen and a processor thatdrives the display screen to present a quality indication of thecaptions being presented to an AU. Here, the quality characteristic mayinclude some accuracy percentage, the actual text being presented to theAU, or some other suitable indication of caption accuracy or an accuracyestimation. In addition, the HU device may present one or more optionsfor upgrading the captioning quality such as, for instance, requestingCA correction of automated text captioning, requesting CA transcriptionand correction, etc.

Time Stamping Voice and Text

In at least some embodiments described above various HU voice delayconcepts have been described where an HU's voice signal broadcast isdelayed in order to bring the voice signal broadcast more temporally inline with associated captioned text. Thus, for instance, in a systemthat requires at least three seconds (and at times more time) totranscribe an HU's voice signal to text for presentation, a systemprocessor may be programmed to introduce a three second delay in HUvoice broadcast to an AU to bring the HU voice signal broadcast moreinto simultaneous alignment with associated text generated by thesystem. As another instance in a system where an ASR requires at leasttwo seconds to transcribe an HU's voice signal to text for presentationto a correcting CA, the system processor may be programmed to introducea two second delay in the HU voice that is broadcast to an AU to bringthe HU voice signal broadcast for into temporal alignment with the ASRgenerated text.

In the above examples, the three and two second delays are simply basedon the average minimum voice-to-text delays that occur with a specificvoice to text system and therefore, at most times, will only impreciselyalign an HU voice signal with corresponding text. For instance, in acase where HU voice broadcast is delayed three seconds, if texttranscription is delayed ten seconds, the three second delay would beinsufficient to align the broadcast voice signal and text presentation.As another instance, where the HU voice is delayed three seconds, if atext transcription is generated in one second, the three second delaywould cause the HU voice to be broadcast two seconds after presentationof the associated text. In other words, in this example, the threesecond HU voice delay would be too much delay at times and too little atother times and misalignment could cause AU confusion.

In at least some embodiments it is contemplated that a transcriptionsystem may assign time stamps to various utterances in an HU's voicesignal and those time stamps may also be assigned to text that is thengenerated from the utterances so that the HU voice and text can beprecisely synchronized per user preferences (e.g., precisely aligned intime or, if preferred by an AU, with an HU's voice preceding or delayedwith respect to text by the same persistent period) when broadcast andpresented to the AU, respectively. While alignment per an AU'spreferences may cause an HU voice to be broadcast prior to or afterpresentation of associated text, hereinafter, unless indicatedotherwise, it will be assumed that an AU's preference is that the HUvoice and related text be broadcast and presented simultaneously atsubstantially the same time (e.g., within 1-2 seconds before or after).It should be recognized that in any embodiment described hereafter wherethe description refers to aligned or simultaneous voice and text, thesame teachings will be applicable to cases where voice and text arepurposefully misaligned by a persistent period (e.g., always misalignedby 3 seconds per user preference).

Various systems are contemplated for assigning time stamps to HU voicesignals and associated text words and/or phrases. In a first relativelysimple case, an AU device that receives an HU voice signal may assignperiodic time stamps to sequentially received voice signal segments andstore the HU voice signal segments along with associated time stamps.The AU device may also transmit at least an initial time stamp (e.g.corresponding to the beginning of the HU voice signal or the beginningof a first HU voice signal segment during a call) along with the HUvoice signal to a relay when captioning is to commence.

In at least some embodiments the relay stores the initial time stamp inassociation with the beginning instant of the received HU voice signaland continues to store the HU voice signal as it is received. Inaddition, the relay operates its own timer to generate time stamps foron-going segments of the HU voice signal as the voice signal is receivedand the relay generated time stamps are stored along with associated HUvoice signal segments (e.g., one time stamp for each segment thatcorresponds to the beginning of the segment). In a case where a relayoperates an ASR engine or taps into a fourth party ASR service (e.g.,Google Voice, IBM's Watson, etc.) where a CA checks and corrects ASRgenerated text, the ASR engine generates automated text for HU voicesegments in real time as the HU voice signal is received.

A CA computer at the relay simultaneously broadcasts the HU voicesegments and presents the ASR generated text to a CA at the relay forcorrection. Here, the ASR engine speed will fluctuate somewhat based onseveral factors that are known in the speech recognition art so that itcan be assumed that the ASR engine will translate a typical HU voicesignal segment to text within anywhere between a fraction of a second(e.g., one tenth of a second) to 10 seconds. Thus, where the CA computeris configured to simultaneously broadcast HU voice and present ASRgenerated text for CA consideration, in at least some embodiments therelay is programmed to delay the HU voice signal broadcast dynamicallyfor a period within the range of a fraction of a second up to themaximum number of seconds required for the ASR engine to transcribe avoice segment to text. Again, here, a CA may have control over thetiming between text presentation and HU voice broadcast and may preferone or the other of the text and voice to precede the other (e.g., HUvoice to proceed corresponding text by two seconds or vice versa). Inthese cases, the preferred delay between voice and text can bepersistent and unchanging which results in less CA confusion. Thus, forinstance, regardless of delay between an HU's initial utterance and ASRtext generation, both the utterance and the associated ASR text can bepersistently presented simultaneously in at least some embodiments.

After a CA corrects text errors in the ASR engine generated text, in atleast some cases the relay transmits the time stamped text back to theAU caption device for display to the AU. Upon receiving the time stampedtext from the relay, the AU device accesses the time stamped HU voicesignal stored thereat and associates the text and HU voice signalsegments based on similar (e.g., closest in time) or identical timestamps and stores the associated text and HU voice signal untilpresented and broadcasted to the AU. The AU device then simultaneously(or delayed per user preference) broadcasts the HU voice signal segmentsand presents the corresponding text to the AU via the AU caption devicein at least some embodiments.

A flow chart that is consistent with this simple first case of timestamping text segments is shown in FIG. 32 and will be described next.Referring also to FIG. 33 , a system similar to the system describedabove with respect to FIG. 1 is illustrated where similar elements arelabelled with the same numbers used in FIG. 1 and, unless indicatedotherwise, operates in a similar fashion. The primary differencesbetween the FIG. 1 system and the system described in FIG. 33 is thateach of the AU caption device 12 and the relay 16 includes a memorydevice that stores, among other things, time stamped voice messagesegments corresponding to a received HU voice signal and that timestamps are transmitted between AU device 12 and relay server 30 (see1034 and 1036).

Referring to FIGS. 32 and 33 , during a call between an HU using an HUdevice 14 and an AU using AU device 12, at some point, captioning isrequired by the AU (e.g., either immediately when the call commences orupon selection of a caption option by the AU) at which point AU device12 performs several functions. First, after captioning is to commence,at block 1102, the HU voice signal is received by the AU device 12. Atblock 1104, AU device 12 commences assignment and continues to assignperiodic time stamps to the HU voice signal segments received at the AUdevice. The time stamps include an initial time stamp tO correspondingto the instant in time when captioning is to commence or some specificinstant in time thereafter as well as following time stamps. Inaddition, at block 1104, AU device 12 commences storing the received HUvoice signal along with the assigned time stamps that divide up the HUvoice signal into segments in AU device memory 1030.

Referring still to FIGS. 32 and 33 , at block 1106, AU device 12transmits the HU voice signal segments to relay 16 along with theinitial time stamp tO corresponding to the instant captioning wasinitiated where the initial time stamp is associated with the start ofthe first HU voice segment transmitted to the relay (see 1034 in FIG. 33). At block 1108, relay 16 stores the initial time stamp t0 along withthe first HU voice signal segment in memory 1032, runs its own timer toassign subsequent time stamps to the HU voice signal received and storesthe HU voice signal segments and relay generated time stamps in memory1032. Here, because both the AU device and the relay assign the initialtime stamp t0 to the same point within the HU voice signal and eachassigns other stamps based on the initial time stamp, all of the AUdevice and relay time stamps should be aligned assuming that eachassigns time stamps at the same periodic intervals (e.g., every second).

In other cases, each of the AU device and relay may assign second andsubsequent time stamps having the form (t0+Δt) where Δt is a period oftime relative to the initial time stamp t0. Thus, for instance, a secondtime stamp may be (t0+1 sec), a third time stamp may be (t0+4 sec), etc.In this case, the AU device and relay may assign time stamps that have adifferent periods where the system simply aligns stamped text and voicewhen required based on closest stamps in time.

Continuing, at block 1110, relay 16 runs an ASR engine to generate ASRengine text for each of the stored HU voice signal segments and storesthe ASR engine text with the corresponding time stamped HU voice signalsegments. At block 1112, relay 16 presents the ASR engine text to a CAfor consideration and correction. Here, the ASR engine text is presentedvia a CA computer display screen 32 while the HU voice segments aresimultaneously (e.g., as text is scrolled onto display 32) broadcast tothe CA via headset 54. The CA uses display 32 and/or other interfacedevices to make corrections (see block 1116) to the ASR engine text.Corrections to the text are stored in memory 1032 and the resulting textis transmitted at block 1118 to AU device 12 along with a separate timestamp for each of the text segments (see 1036 in FIG. 33 ).

Referring yet again to FIGS. 32 and 33 , upon receiving the time stampedtext, AU device 12 correlates the time stamped text with the HU voicesignal segments and associated time stamps in memory 1130 and stores thetext with the associated voice segments and related time stamps at block1120. At block 1122, in some embodiments, AU device 12 simultaneouslybroadcasts and presents the correlated HU voice signal segments and textsegments to the AU via an AU device speaker and the AU device displayscreen, respectively.

Referring still to FIG. 32 , it should be appreciated that the timestamps applied to HU voice signal segments and corresponding textsegments enable the system to align voice and text when presented toeach of a CA and an AU. In other embodiments it is contemplated that thesystem may only use time stamps to align voice and text for one or theother of a CA and an AU. Thus, for instance, in FIG. 32 , thesimultaneous broadcast step at 1112 may be replaced by voice broadcastand text presentation immediately when available and synchronouspresentation and broadcast may only be available to the AU at step 1122.In a different system synchronous voice and text may be provided to theCA at step 1112 while HU voice signal and caption text are independentlypresented to the AU immediately upon reception at steps 1102 and 1122,respectively.

In the FIG. 32 process, the AU only transmits an initial HU voice signaltime stamp to the relay corresponding to the instant when captioningcommences. In other cases it is contemplated that AU device 12 maytransmit more than one time stamp corresponding to specific points intime to relay 16 that can be used to correct any voice and text segmentmisalignment that may occur during system processes. Thus, for instance,instead of sending just the initial time stamp, AU device 12 maytransmit time stamps along with specific HU voice segments every 5seconds or every 10 seconds or every 30 seconds, etc., while a callpersists, and the relay may simply store each newly received time stampalong with an instant in the stream of HU voice signal received.

In still other cases AU device 12 may transmit enough AU devicegenerated time stamps to relay 16 that the relay does not have to runits own timer to independently generate time stamps for voice and textsegments. Here, AU device 12 would still store the time stamped HU voicesignal segments as they are received and stamped and would correlatetime stamped text received back from the relay 16 in the same fashion sothat HU voice segments and associated text can be simultaneouslypresented to the AU.

A sub-process 1138 that may be substituted for a portion of the processdescribed above with respect to FIG. 32 is shown in FIG. 34 , albeitwhere all AU device time stamps are transmitted to and used by a relayso that the relay does not have to independently generate time stampsfor HU voice and text segments. In the modified process, referring alsoand again to FIG. 32 , after AU device 12 assigns periodic time stampsto HU voice signal segments at block 1104, control passes to block 1140in FIG. 34 where AU device 12 transmits the time stamped HU voice signalsegments to relay 16. At block 1142, relay 16 stores the time stamped HUvoice signal segments after which control passes back to block 1110 inFIG. 32 where the relay employs an ASR engine to convert the HU voicesignal segments to text segments that are stored with the correspondingvoice segments and time stamps. The process described above with respectto FIG. 32 continues as described above so that the CA and/or the AU arepresented with simultaneous HU voice and text segments.

In other cases it is contemplated that an AU device 12 may not assignany time stamps to the HU voice signal and, instead, the relay or afourth party ASR service provider may assign all time stamps to voiceand text signals to generate the correlated voice and text segments. Inthis case, after text segments have been generated for each HU voicesegment, the relay may transmit both the HU voice signal and thecorresponding text back to AU device 12 for presentation.

A process 1146 that is similar to the FIG. 32 process described above isshown in FIG. 35 , albeit where the relay generates and assigns all timestamps to the HU voice signals and transmits the correlated time stamps,voice signals and text to the AU device for simultaneous presentation.In the modified process 1146, process steps 1150 through 1154 in FIG. 35replace process steps 1102 through 1108 in FIG. 32 and process steps1158 through 1162 in FIG. 35 replace process steps 1118 through 1122 inFIG. 32 while similarly numbered steps 1110 through 1116 aresubstantially identical between the two processes.

Process 1146 starts at block 1150 in FIG. 35 where AU device 12 receivesan HU voice signal from an HU device where the HU voice signal is to becaptioned. Without assigning any time stamps to the HU voice signal, AUdevice 12 links to a relay 16 and transmits the HU voice signal to relay16 at block 1152. At block 1154, relay 16 uses a timer or clock togenerate time stamps for HU voice signal segments after which controlpasses to block 1110 where relay 16 uses an ASR engine to convert the HUvoice signal to text which is stored along with the corresponding HUvoice signal segments and related time stamps. At block 1112, relay 16simultaneously presents ASR text and broadcasts HU voice segments to aCA for correction and the CA views the text and makes corrections atblock 1116. After block 1116, relay 16 transmits the time stamped textand HU voice segments to AU device 12 and that information is stored bythe AU device as indicated at block 1160. At block 1162, AU device 12simultaneously broadcasts and presents corresponding HU voice and textsegments via the AU device display.

In cases where HU voice signal broadcast is delayed so that thebroadcast is aligned with presentation of corresponding transcribedtext, delay insertion points will be important in at least some cases orat some times. For instance, an HU may speak for 20 consecutive secondswhere the system assigns a time stamp every 2 seconds. In this case, onesolution for aligning voice with text would be to wait until the entire20 second spoken message is transcribed and then broadcast the entire 20second voice message and present the transcribed text simultaneously.This, however, is a poor solution as it would slow down HU-AUcommunication appreciably.

Another solution would be to divide up the 20 second voice message into5 second periods with silent delays therebetween so that thetranscription process can routinely catch up. For instance, here, duringa first five second period plus a short transcription catch up period(e.g., 2 seconds), the first five seconds of the 20 second HU voicemassage is transcribed. At the end of the first 7 seconds of HU voicesignal, the first five seconds of HU voice signal is broadcast and thecorresponding text presented to the AU while the next 5 seconds of HUvoice signal is transcribed. Transcription of the second 5 seconds of HUvoice signal may take another 7 seconds which would meant that a 2second delay or silent period would be inserted after the first fiveseconds of HU voice signal is broadcast to the AU. In other cases theASR text and HU voice would be sent ASAP when generated or received todeliver to the AU. In this case the 7 seconds described would be tocomplete the segment as opposed to for getting the first words to the AUfor broadcast.

This process of inserting periodic delays into HU voice broadcast andtext presentation while transcription catches up continues. Here, whileit is possible that the delays at the five second times would be atideal times between consecutive natural phrases, more often than not,the 5 second point delays would imperfectly divide natural languagephrases making it more, not less difficult, to understand the overall HUvoice message.

A better solution is to insert delays between natural language phraseswhen possible. For instance, in the case of the 20 second HU voicesignal example above, a first delay may be inserted after a first 3second natural language phrase, a second delay may be inserted after asecond 4 second natural language phrase, a third delay may be insertedafter a third 5 second natural language phrase, a fourth delay may beinserted after a fourth 2 second natural language phrase and a fifthdelay may be inserted after a fifth 2 second natural language phrase, sothat none of the natural language phrases during the voice message arebroken up by intervening delays.

Software for identifying natural language phrases or natural breaks inan HU's voice signal may use actual delays between consecutive spokenphrases as one proxy for where to insert a transcription catch up delay.In some cases software may be able to perform word, sentence and/ortopic segmentation in order to identify natural language phrases. Othersoftware techniques for dividing voice signals into natural languagephrases are contemplated and should be used as appropriate.

Thus, while some systems may assign perfectly periodic time stamps to HUvoice signals to divide the signals into segments, in other cases timestamps will be assigned at irregular time intervals that make more sensegiven the phrases that an HU speaks, how an HU speaks, etc.

Voice Message Replay

Where time stamps are assigned to HU voice and text segments, voicesegments can be more accurately selected for replay via selection ofassociated text. For instance, see FIG. 36 that shows a CA displayscreen 50 with transcribed text represented at 1200. Here, as text isgenerated by a relay ASR engine and presented to a CA, consistent withat least some of the systems described above, the CA may select a wordor phrase in presented text via touch (represented by hand icon 1202) toreplay the HU voice signal associated therewith.

When a word is selected in the presented text several things will happenin at least some contemplated embodiments. First, a current voicebroadcast to the CA is halted. Second, the selected word is highlighted(see 1204) or otherwise visually distinguished. Third, when the word ishighlighted, the CA computer accesses the HU voice segment associatedwith the highlighted word and re-broadcasts the voice segment for the CAto re-listen to the selected word. Where time stamps are assigned withshort intervening periods, the time stamps should enable relativelyprecise replay of selected words from the text. In at least some cases,the highlight will remain and the CA may change the highlighted word orphrase via standard text editing tools. For instance, the CA may typereplacement text to replace the highlighted word with corrected text. Asanother instance, the CA may re-voice the broadcast word or phrase sothat software trained to the CA's voice can generate replacement text.Here, the software may use the newly uttered word as well as the wordsthat surround the uttered word in a contextual fashion to identify thereplacement word.

In some cases a “Resume” or other icon 1210 may be presented proximatethe selected word that can be selected via touch to continue the HUvoice broadcast and text presentation at the location where the systemleft off when the CA selected the word for re-broadcast. In other cases,a short time (e.g., ¼th second to 3 seconds) after rebroadcasting aselected word or phrase, the system may automatically revert back to thevoice and text broadcast at the location where the system left off whenthe CA selected the word for re-broadcast.

While not shown, in some cases when a text word is selected, the systemwill also identify other possible words that may correspond to the voicesegment associated with the selected word (e.g., second and third bestoptions for transcription of the HU voice segment associated with theselected word) and those options may be automatically presented fortouch selection and replacement via a list of touch selectable icons,one for each option, similar to Resume icon 1210. Here, the options maybe presented in a list where the first list entry is the most likelysubstitute text option, the second entry is the second most likelysubstitute text option, and so on.

Referring again to FIG. 36 , in other cases when a text word is selectedon a CA display screen 50, a relay server or the CA's computer mayselect an HU voice segment that includes the selected word and alsoother words in an HU voice segment or phrase that includes the selectedword for re-broadcast to the CA so that the CA has some audible contextin which to consider the selected word. Here, when the phrase lengthsegment is re-broadcast, the full text phrase associated therewith maybe highlighted as shown at 1206 in FIG. 36 . In some cases, the selectedword may be highlighted or otherwise visually distinguished in one wayand the phrase length segment that includes the selected word may behighlighted or otherwise visually distinguished in a second way that isdiscernably different to the CA so that the CA is not confused as towhat was selected (e.g., see different highlighting at 1204 and 1206 inFIG. 36 ).

In some cases a single touch on a word may cause the CA computer tore-broadcast the single selected word while highlighting the selectedword and the associated longer phrase that includes the selected worddifferently while a double tap on a word may cause the phrase thatincludes the selected word to be re-broadcast to provide audio context.Where the system divides up an HU voice signal by natural phrases,broadcasting a full phrase that includes a selected word should beparticularly useful as the natural language phrase should be associatedwith a more meaningful context than an arbitrary group of wordssurrounding the selected word.

Even if the system rebroadcasts a full phrase including a selected word,in at least some cases CA edits will be made only to the selected wordas opposed to the full phrase. Thus, for instance, in FIG. 36 where asingle word is selected but a phrase including the word is rebroadcast,any CA edit (e.g., text entry or text generated by software in responseto a revoiced word or phrase) would only replace the selected word, notthe entire phrase.

Upon selection of Resume icon 1210, the highlighting is removed from theselected word and the CA computer restarts simultaneously broadcastingthe HU voice signal and presenting associated transcribed text at thepoint where the computer left off when the re-broadcast word wasselected. In some cases, the CA computer may back up a few seconds fromthe point where the computer left off to restart the broadcast tore-contextualize the voice and text presented to the CA as the CA againbegins correcting text errors.

In other cases, instead of requiring a user to select a “Resume” option,the system may, after a short period (e.g., one second after theselected word or associated phrase is re-broadcast), simply revert backto broadcasting the HU voice signal and presenting associatedtranscribed text at the point where the computer left off when there-broadcast word was selected. Here, a beep or other audiblydistinguishable signal may be generated upon word selection and at theend of a re-broadcast to audibly distinguish the re-broadcast frombroadcast HU voice. In other cases any re-broadcast voice signal may beaudibly modified in some fashion (e.g., higher pitch or tone, greatervolume, etc.) to audibly distinguish the re-broadcast from other HUvoice signal broadcast.

To enable a CA to select a phrase that includes more than one word forrebroadcast or for correction, in at least some cases it is contemplatedthat when a user touches a word presented on the CA display device, thatword will immediately be fully highlighted. Then, while still touchingthe initially selected and highlighted word, the CA can slide her fingerleft or right to select adjacent words until a complete phrase to beselected is highlighted. Upon removing her finger from the displayscreen, the highlighted phrase remains highlighted and revoicing or textentry can be used to replace the entire highlighted phrase.

Referring now to FIG. 37 , a screen shot akin to the screen shot shownin FIG. 26 is illustrated at 50 that may be presented to an AU via an AUdevice display, albeit where an AU has selected a word from withintranscribed text for re-broadcast. In at least some embodiments, similarto the CA system described above, when an AU selects a word frompresented text, the instantaneous HU voice broadcast and textpresentation is halted, the selected word is highlighted or otherwisevisually distinguished as shown at 1230 and the phrase including theselected word may also be differently visually distinguished as shown at1231. Beeps or other audible signals may be generated immediately priorto and after re-broadcast of a voice signal segment. When a word isselected, the AU device speaker (e.g., the speaker in associated handset22) re-broadcasts the HU voice signal that is associated through theassigned time stamp to the selected word. In other cases the AU devicewill re-broadcast the entire phrase or sub-phrase that includes theselected word to give audio context to the selected word.

Referring again to FIG. 37 , when an AU selects a word forrebroadcasting, in at least some cases if that word is still on a CA'sdisplay screen when the AU selects the word, that word may be speciallyhighlighted on the CA display to alert or indicate to the CA that the AUhad trouble understanding the selected word. To this end, see in FIG. 36that the word selected in FIG. 37 is highlighted on the exemplary CAdisplay screen at 1201. Here, the CA may read the phrase including theword and either determine that the text is accurate or that atranscription error occurred. Where the text is wrong, the CA maycorrect the text or may simply ignore the error and continue on withtranscription of the continuing HU voice signal.

While the time stamping concept is described above with respect to asystem where an ASR initially transcribes an HU voice signal to text anda CA corrects the ASR generated text, the time stamping concept is alsoadvantageously applicable to cases where a CA transcribes an HU voicesignal to text and then corrects the transcribed text or where a secondCA corrects text transcribed by a first CA. To this end, in at leastsome cases it is contemplated that an ASR may operate in the backgroundof a CA transcription system to generate and time stamp ASR text (e.g.,text generated by an ASR engine) in parallel with the CA generated text.A processor may be programmed to compare the ASR text and CA generatedtext to identify at least some matching words or phrases and to assignthe time stamps associated with the matching ASR generated words orphrases to the matching CA generated text.

It is recognized that the CA text will likely be more accurate than theASR text most of the time and therefore that there will be differencesbetween the two text strings. However, some if not most of the time theASR and CA generated texts will match so that many of the time stampsassociated with the ASR text can be directly applied to the CA generatedtext to align the HU voice signal segments with the CA generated text.In some cases it is contemplated that confidence factors may begenerated for likely associated ASR and CA generated text and timestamps may only be assigned to CA generated text when a confidencefactor is greater than some threshold confidence factor value (e.g.,88/100). In most cases it is expected that confidence factors thatexceed the threshold value will occur routinely and with shortintervening durations so that a suitable number of reliable time stampscan be generated.

Once time stamps are associated with CA generated text, the stamps maybe used to precisely align HU voice signal broadcast and textpresentation to an AU or a CA (e.g., in the case of a second “correctingCA”) as described above as well as to support re-broadcast of HU voicesignal segments corresponding to selected text by a CA and/or an AU.

A sub-process 1300 that may be substituted for a portion of the FIG. 32process is shown in FIG. 38 , albeit where ASR generated time stamps areapplied to CA generated text. Referring also to FIG. 32 , steps 1302through 1310 shown in FIG. 38 are swapped into the FIG. 32 process forsteps 1112 through 1118. Referring also to FIG. 32 , after an ASR enginegenerates and stores time stamped text segments for a received HU voicesignal segment, control passes to block 1302 in FIG. 38 where the relaybroadcasts the HU voice signal to a CA and the CA revoices the HU voicesignal to transcription software trained to the CA's voice and thesoftware yields CA generated text.

At block 1304, a relay server or processor compares the ASR text to theCA generated text to identify high confidence “matching” words and/orphrases. Here, the phrase high confidence means that there is a highlikelihood (e.g., 95% likely) that an ASR text word or phrase and a CAgenerated text word or phrase both correspond to the exact same HU voicesignal segment. Characteristics analyzed by the comparing processorinclude multiple word identical or nearly identical strings in comparedtext, temporally when text appears in each text string relative to otherassigned time stamps, easily transcribed words where both an ASR and aCA are highly likely to accurately transcribe words, etc. In some casestime stamps associated with the ASR text are only assigned to the CAgenerated text when the confidence factor related to the comparison isabove some threshold level (e.g., 88/100). Time stamps are assigned atblock 1306 in FIG. 38 .

At block 1308, the relay presents the CA generated text to the CA forcorrection and at block 1310 the relay transmits the time stamped CAgenerated text segments to the AU device. After block 1310 controlpasses back to block 1120 in FIG. 32 where the AU device correlates timestamped CA generated text with HU voice signal segments previouslystored in the AU device memory and stores the times, text and associatedvoice segments. At block 1122, the AU device simultaneously broadcastsand presents identically time stamped HU voice and CA generated text toan AU. Again, in some cases, the AU device may have already broadcastthe HU voice signal to the AU prior to block 1122. In this case, uponreceiving the text, the text may be immediately presented via the AUdevice display to the AU for consideration. Here, the time stamped HUvoice signal and associated text would only be used by the AU device tosupport synchronized HU voice and text re-play or representation.

In some cases the time stamps assigned to a series of text and voicesegments may simply represent relative time stamps as opposed to actualtime stamps. For instance, instead of labelling three consecutive HUvoice segments with actual times 3:55:45 AM; 3:55:48 AM; 3:55:51 AM . .. , the three segments may be labelled t0, t1, t2, etc., where thelabels are repeated after they reach some maximum number (e.g., t20). Inthis case, for instance, during a 20 second HU voice signal, the 20second signal may have five consecutive labels t0, t1, t2, t3 and t4assigned, one every four seconds, to divide the signal into fiveconsecutive segments. The relative time labels can be assigned to HUvoice signal segments and also associated with specific transcribed textsegments.

In at least some cases it is contemplated that the rate of time stampassignment to an HU voice signal may be dynamic. For instance, if an HUis routinely silent for long periods between intermittent statements,time stamps may only be assigned during periods while the HU isspeaking. As another instance, if an HU speaks slowly at times and morerapidly at other times, the number of time stamps assigned to the user'svoice signal may increase (e.g., when speech is rapid) and decrease(e.g., when speech is relatively slow) with the rate of user speech.Other factors may affect the rate of time stamps applied to an HU voicesignal.

While the systems describe above are described as ones where time stampsare assigned to an HU voice signal by either or both of an AU's deviceand a relay, in other cases it is contemplated that other system devicesor processors may assign time stamps to the HU voice signal including afourth party ASR engine provider (e.g., IBM's Watson, Google Voice,etc.). In still other cases where the HU device is a computer (e.g., asmart phone, a tablet type computing device, a laptop computer), the HUdevice may assign time stamps to the HU voice signal and transmit toother system devices that need time stamps. All combinations of systemdevices assigning new or redundant time stamps to HU voice signals arecontemplated.

In any case where time stamps are assigned to voice signals and textsegments, words, phrases, etc., the engine(s) assigning the time stampsmay generate stamps indicating any of (1) when a word or phrase isvoiced in an HU voice signal audio stream (e.g., 16:22 to 16:22:5corresponds to the word “Now”) and (2) the time at which text isgenerated by the ASR for a specific word (e.g., “Now” generated at16:25). Where a CA generates text or corrects text, a processor relatedto the relay may also generate time stamps indicating when a CAgenerated word is generated as well as when a correction is generated.

In at least some embodiments it is contemplated that any time a CA fallsbehind when transcribing an HU voice signal or when correcting an ASRengine generated text stream, the speed of the HU voice signal broadcastmay be automatically increased or sped up as one way to help the CAcatch up to a current point in an HU-AU call. For instance, in a simplecase, any time a CA caption delay (e.g., the delay between an HU voiceutterance and CA generation of text or correction of text associatedwith the utterance) exceeds some threshold (e.g., 12 seconds), the CAinterface may automatically double the rate of HU signal broadcast tothe CA until the CA catches up with the call.

In at least some cases the rate of broadcast may be dynamic between anominal value representing the natural speaking speed of the HU and amaximum rate (e.g., increase the natural HU voice speed three times),and the instantaneous rate may be a function of the degree of captioningdelay. Thus, for instance, where the captioning delay is only 4 or lessseconds, the broadcast rate may be “1” representing the natural speakingspeed of the HU, if the delay is between 4 and 8 seconds the rebroadcastrate may be “2” (e.g., twice the natural speaking speed), and if thedelay is greater than 8 seconds, the broadcast rate may be “3” (e.g.,three times the natural speaking speed).

In other cases the dynamic rate may be a function of other factors suchas but not limited to the rate at which an HU utters words, perceivedclarity in the connection between the HU and AU devices or between theAU device and the relay or between any two components within the system,the number of corrections required by a CA during some sub-call period(e.g., the most recent 30 seconds), statistics related to how accuratelya CA can generate text or make text corrections at different speakingrates, some type of set AU preference, some type of HU preference, etc.

In some cases the rate of HU voice broadcast may be based on ASRconfidence factors. For instance, where an ASR assigns a high confidencefactor to a 15 second portion of HU voice signal and a low confidencefactor to the next 10 seconds of the HU voice signal, the HU voicebroadcast rate may be set to twice the rate of HU speaking speed duringthe first 15 second period and then be slowed down to the actual HUspeaking speed during the next 10 second period or to some otherpercentage of the actual HU speaking speed (e.g., 75% or 125%, etc.).

In some cases the HU broadcast rate may be at least in part based oncharacteristics of an HU's utterances. For instance, where an HU'svolume on a specific word is substantially increased or decreased, theword (or phrase including the word) may always be presented at the HUspeaking speed (e.g., at the rate uttered by the HU). In other cases,where the volume of one word within a phrase is stressed, the entirephrase may be broadcast at speaking speed so that the full effect of thestressed word can be appreciated. As another instance, where an HU drawsout pronunciation of a word such as “Well . . . ” for 3 seconds, theword (or phrase including the word) may be presented at the spoken rate.

In some cases the HU voice broadcast rate may be at least in part basedon words spoken by an HU or on content expressed in an HU's spokenwords. For instance, simple words that are typically easy to understandincluding “Yes”, “No”, etc., may be broadcast at a higher rate thancomplex words like some medical diagnosis, multi-syllable terms, etc.

In cases where the system generates text corresponding to both HU and AUvoice signals, in at least some embodiments it is contemplated thatduring normal operation only text associated with the HU signal may bepresented to an AU and that the AU text may only be presented to the AUif the AU goes back in the text record to review the text associatedwith a prior part of a conversation. For instance, if an AU scrolls backin a conversation 3 minutes to review prior discussion, ASR generated AUvoice related text may be presented at that time along with the HU textto provide context for the AU viewing the prior conversation.

In the systems described above, whenever a CA is involved in a captionassisted call, the CA considers an entire HU voice signal and eithergenerates a complete CA generated text transcription of that signal orcorrects ASR generated text errors while considering the entire HU voicesignal. In other embodiments it is contemplated that where an ASR enginegenerates confidence factors, the system may only present sub-portionsof an HU voice signal to a CA that are associated with relatively lowconfidence factors for consideration to speed up the error correctionprocess. Here, for instance, where ASR engine confidence factors arehigh (e.g., above some high factor threshold) for a 20 second portion ofan HU voice signal and then are low for the next 10 seconds, a CA mayonly be presented the ASR generated text and the HU voice signal may notbe broadcast to the CA during the first 20 seconds while substantiallysimultaneous HU voice and text are presented to the CA during thefollowing 10 second period so that the CA is able to correct any errorsin the low confidence text. In this example, it is contemplated that theCA would still have the opportunity to select an interface option tohear the HU voice signal corresponding to the first 20 second period orsome portion of that period if desired.

In some cases only a portion of HU voice signal corresponding to lowconfidence ASR engine text may be presented at all times and in othercases, this technique of skipping broadcast of HU voice associated withhigh confidence text may only be used by the system during thresholdcatch up periods of operation. For instance, the technique of skippingbroadcast of HU voice associated with high confidence text may only kickin when a CA text correction process is delayed from an HU voice signalby 20 or more seconds (e.g., via a threshold period).

In particularly advantages cases, low confidence text and associatedvoice may be presented to a CA at normal speaking speed and highconfidence text and associated voice may be presented to a CA at anexpedited speed (e.g., 3 time normal speaking speed) when a textpresentation delay (e.g., the period between the time an HU uttered aword and the time when a text representation of the word is presented tothe CA) is less than a maximum latency period, and if the delay exceedsthe maximum latency period, high confidence text may be presented inblock form (e.g., as opposed to rapid sequential presentation ofseparate words) without broadcasting the HU voice to expedite thecatchup process.

In cases where a system processor or sever determines when toautomatically switch or when to suggest a switch from a CA captioningsystem to an ASR engine captioning system, several factors may beconsidered including the following:

-   -   1. Percent match between ASR generated words and CA generated        words over some prior captioning period (e.g., last 30 seconds);    -   2. How accurate ASR confidence factors reflect corrections made        by a CA;    -   3. Words per minute spoken by an HU and how that affects        accuracy;    -   4. Average delay between ASR and CA generated text over some        prior captioning period;    -   5. An expressed AU preference stored in an AU preferences        database accessible by a system processor;    -   6. Current AU preferences as set during an ongoing call via an        on screen or other interface tool;    -   7. Clarity of received signal or some other proxy for line        quality of the link between any two processors or servers within        the system;    -   8. Identity of a HU conversing with an AU; and    -   9. Characteristics of a HU's voice signal.

Other factors are contemplated.

Handling Automatic and Ongoing ASR Text Corrections

In at least some cases a speech recognition engine will sequentiallygenerate a sequence of captions for a single word or phrase uttered by aspeaker. For instance, where an HU speaks a word, an ASR engine maygenerate a first “estimate” of a text representation of the word basedsimply on the sound of the individual word and nothing more. Shortlythereafter (e.g., within 1 to 6 seconds), the ASR engine may considerwords that surround (e.g., come before and after) the uttered word alongwith a set of possible text representations of the word to identify afinal estimate of a text representation of the uttered word based oncontext derived from the surrounding words. Similarly, in the case of aCA revoicing an HU voice signal to an ASR engine trained to the CA voiceto generate text, multiple iterations of text estimates may occursequentially until a final text representation is generated.

In at least some cases it is contemplated that every best estimate of atext representation of every word to be transcribed will be transmittedimmediately upon generation to an AU device for continually updatedpresentation to the AU so that the AU has the best HU voice signaltranscription that exists at any given time. For instance, in a casewhere an ASR engine generates at least one intermediate text estimateand a final text representation of a word uttered by an HU and where aCA corrects the final text representation, each of the interim textestimate, the final text representation and the CA corrected text may bepresented to the AU where updates to the text are made as in linecorrections thereto (e.g., by replacing erroneous text with correctedtext directly within the text stream presented) or, in the alternative,corrected text may be presented above or in some spatially associatedlocation with respect to erroneous text.

In cases where an ASR engine generates intermediate and final textrepresentations while a CA is also charged with correcting text errors,if the ASR engine is left to continually make context dependentcorrections to text representations, there is the possibility that theASR engine could change CA generated text and thereby undue an intendedand necessary CA correction.

To eliminate the possibility of an ASR modifying CA corrected text, inat least some cases it is contemplated that automatic ASR enginecontextual corrections for at least CA corrected text may be disabledimmediately after a CA correction is made or even once a CA commencescorrecting a specific word or phrase. In this case, for instance, when aCA initiates a text correction or completes a correction in textpresented on her device display screen, the ASR engine may be programmedto assume that the CA corrected text is accurate from that pointforward. In some cases, the ASR engine may be programmed to assume thata CA corrected word is a true transcription of the uttered word whichcan then be used as true context for ascertaining the text to beassociated with other ASR engine generated text words surrounding thetrue or corrected word. In some cases text words prior to and followingthe CA corrected word may be corrected by the ASR engine based on the CAcorrected word that provides new context or independent of that contextin other cases. Hereinafter, unless indicated otherwise, when an ASRengine is disabled from modifying a word in a text phrase, the word willbe said to be “firm”.

In still other embodiments it is contemplated that after a CA listens toa word or phrase broadcast to the CA or some short duration of timethereafter, the word or phrase may become firm irrespective of whetheror not a CA corrects that word or phrase or another word or phrasesubsequent thereto. For instance, in some cases once a specific word isbroadcast to a CA for consideration, the word may be designated firm. Inthis case each broadcast word is made firm immediately upon broadcast ofthe word and therefore after being broadcast, no word is automaticallymodified by an ASR engine. Here the idea is that once a CA listens to abroadcast word and views a representation of that word as generated bythe ASR engine, either the word is correct or if incorrect, the CA islikely about to correct that word and therefore an ASR correction couldbe confusing and should be avoided.

As another instance, in some cases where a word forms part of a largerphrase, the word and other words in the phrase may not be designatedfirm until after either (1) a CA corrects the word or a word in thephrase that is subsequent thereto or (2) the entire phrase has beenbroadcast to the CA for consideration. Here, the idea is that in manycases a CA will have to listen to an entire phrase in order to assessaccuracy of specific transcribed words so firming up phrase words priorto complete broadcast of the entire phrase may be premature.

As yet one other instance, in some cases automatic firm designations maybe assigned to each word in an HU voice signal a few seconds (e.g., 3seconds) after the word is broadcast, a few words (e.g., 5 words) afterthe word is broadcast, or in some other time related fashion.

In at least some cases it is contemplated that if a CA corrects a wordor words at one location in presented text, if an ASR subsequentlycontextually corrects a word or phrase that precedes the CA correctedword or words, the subsequent ASR correction may be highlighted orotherwise visually distinguished so that the CA's attention is calledthereto to consider the ASR correction. In at least some cases, when anASR corrects text prior to a CA text correction, the text that wascorrected may be presented in a hovering tag proximate the ASRcorrection and may be touch selectable by the CA to revert back to thepre-correction text if the CA so chooses. To this end, see the CAinterface screen shot 1391 shown in FIG. 43 where ASR generated text isshown at 1393 that is similar to the text presented in FIG. 39 , albeitwith a few corrections. More specifically, in FIG. 43 , it is assumedthat a CA corrected the word “cods” to “kids” at 1395 (compare again toFIG. 39 ) after which an ASR engine corrected the prior word “bing” to“bring”. The prior ASR corrected word is highlighted or distinguished asshown at 1397 and the word that was changed to make the correction ispresented in hovering tag 1399. Tag 1399 is touch selectable by the CAto revert back to the prior word if selected.

In other cases where a CA initiates or completes a word correction, theASR engine may be programmed to disable generating additional estimatesor hypothesis for any words uttered by the HU prior to the CA correctedword or within a text segment or phrase that includes the correctedword. Thus, for instance, in some cases, where 30 text words appear on aCA's display screen, if the CA corrects the fifth most recentlypresented word, the fifth most recently corrected word and the 25preceding words would be rendered firm and unchangeable via the ASRengine. Here, in some cases the CA would still be free to change anyword presented on her display screen at any time. In other cases, once aCA corrects a word, that word and any preceding text words may be firmas to both the CA and the ASR engine.

In at least some embodiments a CA interface may be equipped with somefeature that enables a CA to firm up all current text results prior tosome point in a caption representation on the CA's and AU's displayscreens. For instance, in some cases a specific simultaneous keyboardselection like the “Esc” key and an “F1” key while a cursor is at aspecific location in a caption representation may cause all text thatprecedes that point, whether ASR initial, ASR corrected, CA initial orCA corrected, to become firm. As another instance, in at least somecases where a CA's display screen is touch sensitive, a CA may contactthe screen at a location associated with a captioned word and mayperform some on screen gesture to indicate that words prior theretoshould be made firm. For example, the on screen gesture may include aswipe upward, a double tap, or some other gesture reserved for firmingup prior captioned text on the screen.

In still other cases one or more interface output signals may be used bya CA to help the CA track the CA's correction efforts. For instance,whenever a CA corrects a word or phrase in caption text, all text priorto and including the correction may be highlighted or otherwise visuallydistinguished (e.g., text color changed) to indicate the point of themost recent CA text change. Here, the CA could still make changed priorto the most recent change but the color change to indicate the latestchange in the text would persist. In still other cases the CA may beable to select specific keys like an “Esc” key and some other key (e.g.,“F2”) to change text color prior to the selected point as an indicationto the CA that prior text has already been considered. In still othercases it is contemplated that on screen “checked” options may bepresented on the CA screen that are selectable to indicate that textprior thereto has been considered and the color should be changed. Tothis end see FIG. 50 where “Checked” icons (two labelled 1544) arepresented after each punctuation mark to separate consecutive sentencesin ASR generated text 1540. Here, if one of the checked icons isselected, text prior thereto may be highlighted or otherwise visuallydistinguished to indicate prior correction consideration.

While not shown, whenever text is firmed up and/or whenever a CA hasindicated that text has been considered for correction, in addition toindicating that status on the CA display screen, in at least some casesthat status may be indicated in a similar fashion on an AU devicedisplay screen.

When a CA firms up specific text, in at least some cases even if the CAis listening to HU voice signal prior to the point at which the text isfirmed up, the system may automatically jump the HU voice broadcastpoint to the firmed up point so that the CA does not hear theintervening HU voice signal. When a voice signal jumps ahead, a warningmay be presented to the CA on the CA's display screen confirming thejump ahead. In other cases the CA may still have to listen to theintervening HU voice signal. In still other cases the system may playthe intervening HU voice signal at a double, triple or some othermultiple of the original speech rate to expedite the process of workingthrough the intervening voice signal.

In at least some cases an AU device may support automatic triggers thatcause CA activity to skip forward to a current time. For instance, in anASR-CA backed up mode, in at least some cases where an AU has at leastsome hearing capability, it may be assumed that when an AU speaks, theAU is responding to a most recent HU voice signal broadcast andtherefore understood the most recent HU voice signal and therefore thatthe AU's understanding of the conversation is current. Here, assumingthe AU has a current understanding, the system may automatically skip CAerror correction activities to the current HU voice signal andassociated ASR text so that any error correction delay is eliminated. Ina similar fashion, in a CA caption mode, if an AU speaks, based on theassumption that the AU has a current understanding of the conversation,the system may automatically skip CA text generation and errorcorrection activities to the current HU voice signal so that any textgeneration and error correction delay is eliminated. In this case,because there is no ASR text prior to the delay skipping, in parallelwith the skipping activity, an ASR may generate fill in text toautomatically for the HU voice signal not already captioned by the CA.Any skipping ahead based on AU speech may also firm up all textpresented to the AU prior to that point as well as any fill in textwhere appropriate.

In cases where an AU's voice signal operates as a catch up trigger, inat least some cases the trigger may require absence of typical words orphrases that are associated with a confused state. For instance, anexemplary phrase that indicates confusion may be “What did you say?” Asanother instance, an exemplary phrase may be “Can you repeat?” In thiscase, several predefined words or phrases may be supported by the systemand, any time one of those words or phrases is uttered by an AU, thesystem may forego skipping the delayed period so that CA errorcorrection or CA captioning with error correction continues unabated. Inother cases the relay server may apply artificial intelligence torecognize when a word or phrase likely indicates confusion and similarlymay forego skipping the delayed period so that CA error correction or CAcaptioning with error correction continues unabated. If the AU's utteredword or phrase is not associated with confusion, as described above, theCA activities (e.g., error correction or captioning and errorcorrection) are skipped ahead to the current HU voice signal.

In some cases there may be restrictions on text corrections that may bemade by a CA. For instance, in a simple case where an AU device can onlypresent a maximum of 50 words to an AU at a time, the system may onlyallow a CA to correct text corresponding to the 50 words most recentlyuttered by an HU. Here, the idea is that in most cases it will make nosense for a CA to waste time correcting text errors in text prior to themost recently uttered 50 words as an AU will only rarely care to back upin the record to see prior generated and corrected text. Here, thewindow of text that is correctable may be a function of several factorsincluding font type and size selected by an AU on her device, the typeand size of display included in an AUs device, etc. This feature ofrestricting CA corrections to AU viewable text is effectively a limit onhow far behind CA error corrections can lag.

In some cases it is contemplated that a call may start out with full CAerror correction so that the CA considers all ASR engine generated textbut that, once the error correction latency exceeds some thresholdlevel, that the CA may only be able to or may be encouraged to onlycorrect low confidence text. For instance, the latency limit may be 10seconds at which point all ASR text is presented but low confidence textis visually distinguished in some fashion designed to encouragecorrection. To this end see for instance FIG. 40 where low and highconfidence text is presented to a CA in different scrolling columns. Insome cases error correction may be limited to the left column lowconfidence text as illustrated. FIG. 40 is described in more detailhereafter. Where only low confidence text can be corrected, in at leastsome cases the HU voice signal for the high confidence text may not bebroadcast.

As another example, see FIG. 40A where a CA display screen shot 1351includes the same text 1353 as in FIG. 40 presented in a scrollingfashion and where phrases (only one shown) that include one or somethreshold of low confidence factor words are visually distinguished(e.g., via a field border 1355, via highlighting, via different textfont characteristics, etc.) to indicate the low confidence factor wordsand phrases. Here, in some cases the system may only broadcast the lowconfidence phrases skipping from one to the next to expedite the errorcorrection process. In other cases the system may increase the HU voicesignal broadcast rate (e.g., 2×, 3×, etc.) between low confidencephrases and slow the rate down to a normal rate during low confidencephrases so that the CA continues to be able to consider low confidencephrases in full context.

In some cases, only low confidence factor text and associated HU voicesignal may be presented and broadcast to a CA for consideration withsome indication of missing text and voice between the presented textwords or phrases. For instance, turn piping representations (see again216 in FIG. 17 ) may be presented to a CA between low confidenceeditable text phrases.

In other cases, while interim and final ASR engine text may be presentedto an AU, a CA may only see final ASR engine text and therefore only beable to edit that text. Here, the idea is that most of the time ASRengine corrections will be accurate and therefore, by delaying CAviewing until final ASR engine text is generated, the number of requiredCA corrections will be reduced appreciably. It is expected that thissolution will become more advantageous as ASR engine speed increases sothat there is minimal delay between interim and final ASR engine textrepresentations.

In still other cases it is contemplated that only final ASR engine textmay be sent on to an AU for consideration. In this case, for instance,ASR generated text may be transmitted to an AU device in blocks wherecontext afforded by surrounding words has already been used to refinetext hypothesis. For instance, words may be sent in five word textblocks where the block sent always includes the 6th through 10th mostrecently transcribed words so that the most recent through fifth mostrecent words can be used contextually to generate final text hypothesisfor the 6th through 10th most recent words. Here, CA text correctionswould still be made at a relay and transmitted to the AU device for inline corrections of the ASR engine final text.

In this case, if a CA takes over the task of text generation from an ASRengine for some reason (e.g., an AU requests CA help), the system mayswitch over to transmitting CA generated text word by word as the textis generated. In this case CA corrections would again be transmittedseparately to the AU device for in line correction. Here, the idea isthat the CA generated text should be relatively more accurate than theASR engine generated text and therefore immediate transmission of the CAgenerated text to the AU would result in a lower error presentation tothe AU.

While not shown, in at least some embodiments it is contemplated thatturn piping type indications may be presented to a CA on her interfacedisplay as a representation of the delay between the CA text generationor correction and the ASR engine generated text. To this end, see theexemplary turn piping 216 in FIG. 17 . A similar representation may bepresented to a CA.

Where CA corrections or even CA generated text is substantially delayed,in at least some cases the system may automatically force a split tocause an ASR engine to catch up to a current time in a call and to firmup (e.g., disable a CA from changing the text) text before the splittime. In addition, the system may identify a preferred split prior towhich ASR engine confidence factors are high. For instance, where ASRengine text confidence factors for spoken words prior to the most recent15 words are high and for the last fifteen words are low, the system mayautomatically suggest or implement a split at the 15th most recent wordso that ASR text prior to that word is firmed up and text thereafter isstill presented to the CA to be considered and corrected. Here, the CAmay reject the split either by selecting a rejection option or byignoring the suggestion or may accept the suggestion by selecting anaccept option or by ignoring the suggestion (e.g., where the split isautomatic if not rejected in some period (e.g., 2 seconds)). To thisend, see the exemplary CA screen shot in FIG. 39 where ASR generatedtext is shown at 1332. In this case, the CA is behind in errorcorrection so that the CA computer is currently broadcasting the word“want” as indicted by the “Broadcast” tag 1334 that moves along the ASRgenerated text string to indicate to the CA where the current broadcastpoint is located within the overall string. A “High CF-Catch Up” tag1338 is provided to indicate a point within the overall ASR text stringpresented prior to which ASR confidence factors are high and, afterwhich ASR confidence factors are relatively lower. Here, it iscontemplated that a CA would be able to select tag 1338 to skip to thetagged point within the text. If a CA selects tag 1338, the broadcastmay skip to the associated tagged point so that “Broadcast” tag 1334would be immediately moved to the point tagged by tag 1338 where the HUvoice broadcast would recommence. In other cases, selecting highconfidence tag 1338 may cause accelerated broadcast of text between tags1334 and 1338 to expedite catch up.

Referring to FIG. 40 , another exemplary CA screen shot 1333 that may bepresented to show low and high confidence text segments and to enable aCA to skip to low confidence text and associated voice signal isillustrated. Screen shot 1333 divides text into two columns including alow confidence column 1335 and a high confidence column 1337. Lowconfidence column 1335 includes text segments that have ASR assignedconfidence factors that are less than some threshold value which highconfidence column 1337 include text segments that have ASR assignedconfidence factors that are greater than the threshold value. Column1335 is presented on the left half of screen shot 1333 and column 1337is presented on the right half of shot 1333. The two columns wouldscroll upward simultaneously as more text is generated. Again, a currentbroadcast tag 1339 is provided at a current broadcast point in thepresented text. Also, a “High CF, Catch Up” tag 1341 is presented at thebeginning of a low confidence text segment. Here, again, it iscontemplated that a CA may select the high confidence tag 1341 to skipthe broadcast forward to the associated point to expedite the errorcorrection process. As shown, in at least some cases, if the CA does notskip ahead by selecting tag 1341, the HU voice broadcast may be at 2× ormore the speaking speed so that catch up can be more rapid.

In at least some cases it is contemplated that when a call is receivedat an AU device or at a relay, a system processor may use the callingnumber (e.g., the number associated with the calling party or thecalling parties device) to identify the least expensive good option forgenerating text for a specific call. For instance, for a specific firstcaller, a robust and reliable ASR engine voice model may already existand therefore be useable to generate automated text without the need forCA involvement most of the time while no model may exist for a secondcaller that has not previously used the system. In this case, the systemmay automatically initiate captioning using the ASR engine and firstcaller voice model for first caller calls and may automatically initiateCA assisted captioning for second caller calls so that a voice model forthe second caller can be developed for subsequent use. Where thereceived call is from an AU and is outgoing to an HU, a similar analysisof the target HU may cause the system to initiate ASR engine captioningor CA assisted captioning.

In some embodiments identity of an AU (e.g., an AU's phone number orother communication address) may also be used to select which of two ormore text generation options to use to at least initiate captioning.Thus, some AU's may routinely request CA assistance on all calls whileothers may prefer all calls to be initiated as ASR engine calls (e.g.,for privacy purposes) where CA assistance is only needed upon requestfor relatively small sub-periods of some calls. Here, AU phone oraddress numbers may be used to assess optimal captioning type.

In still other cases both a called and a calling number may be used toassess optimal captioning type. Here, in some cases, an AU number oraddress may trump an HU number or address and the HU number or addressmay only be used to assess caption type to use initially when the AU hasno perceived or expressed preference.

Referring again to FIG. 39 , it has been recognized that, in addition totext corresponding to an HU voice signal, an optimal CA interface needsadditional information that is related to specific locations within apresented text string. For instance, specific virtual control buttonsneed to be associated with specific text string locations. For example,see the “High CF-Catch Up” button in FIG. 39 . As other examples, a“resume” tag 1233 as in FIG. 36 or a correction word (see FIG. 20 ) mayneed to be linked to a specific text location. As another instance, insome cases a “broadcast” tag indicating the word currently beingbroadcast may have to be linked to a specific text location (see FIG. 39).

In at least some embodiments, a CA interface or even an AU interfacewill take a form where text lines are separated by at least one blankline that operates as an “additional information” field in which othertext location linked information or content can be presented. To thisend, see FIG. 39 where additional information fields are collectivelylabelled 1215. In other embodiments it is contemplated that theadditional information fields may also be provided below associated textlines. In still other embodiments, other text fields may be presented asseparate in line fields within the text strings (see 1217 in FIG. 40 ).

Training, Gamification, CA Scoring, CA Profiles

In many industries it has been recognized that if a tedious job can begamified, employee performance can be increased appreciably as employeeswork through obstacles to increase personal speed and accuracy scoresand, in some cases, to compete with each other. Here, in addition toincreased personal performance, an employing entity can develop insightsinto best work practices that can be rolled out to other employeesattempting to better their performance. In addition, where there areclear differences in CA capabilities under different sets ofcircumstances, CA scoring can be used to develop CA profiles so thatwhen circumstances can be used to distinguish optimal CAs for specificcalls, an automated system can distribute incoming calls to optimal CAsfor those specific calls or can move calls among CAs mid-call so thatthe best CA for each call or parts of calls can be employed.

In the present case, various systems are being designed and tested toadd gamification, scoring and profile generating aspects to the textcaptioning and/or correction processes performed by CAs. In this regard,in some cases it has been recognized that if a CA simply operates inparallel with an ASR engine to generate text, a CA may be tempted tosimply let the ASR engine generate text without diligent errorcorrection which, obviously, is not optimal for AU's receiving systemgenerated text where caption accuracy is desired and even required to beat high levels.

To avoid CAs shirking their error correction responsibilities and tohelp CAs increase their skills, in at least some embodiments it iscontemplated that a system processor that drives or is associated with aCA interface may introduce periodic and random known errors into ASRgenerated text that is presented to a CA as test errors. Here, the ideais that a CA should identify the test errors and at least attempt tomake corrections thereto. In most cases, while errors are presented tothe CA, the errors are not presented to an AU and instead the likelycorrect ASR engine text is presented to the AU. In some cases the systemallows a CA to actually correct the erroneous text without knowing whicherrors are ASR generated and which are purposefully introduced as partof the one of the gamification or scoring processes. Here, by requiringthe CA to make the correction, the system can generate metrics on howquickly the CA can identify and correct caption errors.

In other cases, when a CA selects an introduced text error to make acorrection, the interface may automatically make the correction uponselection so that the CA does not waste additional time rendering acorrection. In some cases, when an introduced error is corrected eitherby the interface or the CA, a message may be presented to the CAindicating that the error was a purposefully introduced error.

Referring to FIG. 41 , a method 1350 that is consistent with at leastsome aspects of the present disclosure for introducing errors into anASR text stream for testing CA alertness is illustrated. At block 1352,an ASR engine generates ASR text segments corresponding to an HU voicesignal. At block 1354, a relay processor or ASR engine assignsconfidence factors to the ASR text and at block 1356, the relayidentifies at least one high confidence text segment as a “test”segment. At block 1358, the processor transmits the high confidence testsegment to an AU device for display to an AU. At block 1360, theprocessor identifies an error segment to be swapped into the ASRgenerated text for the test segment to be presented to the CA. Forinstance, where a high confidence test segment includes the phrase “Johncame home on Friday”, the processor may generate an exemplary errorsegment like “John camp home on Friday”.

Referring still to FIG. 41 , at block 1362, the processor presents textwith the error segment to the CA as part of an ongoing text stream toconsider for error correction. At decision block 1364, the processormonitors for CA selection of words or phrases in the error segment to becorrected. Where the CA does not select the error segment forcorrection, control passes to block 1372 where the processor stores anindication that the error segment was not identified and control passesback up to block 1352 where the process continues to cycle. In addition,at block 1372, the processor may also store the test segment, the errorsegment and a voice clip corresponding to the test segment that maylater be accessed by the CA or an administrator to confirm the missederror.

Referring again to block 1364 in FIG. 41 , if the CA selects the errorsegment for correction, control passes to block 1366 where the processorautomatically replaces the error segment with the test segment so thatthe CA does not have to correct the error segment. Here the test segmentmay be highlighted or otherwise visually distinguished so that the CAcan see the correction made. In addition, in at least some cases, atblock 1368, the processor provides confirmation that the error segmentwas purposefully introduced and corrected. To this end, see the“Introduced Error—Now Corrected” tag 1331 in FIG. 39 that may bepresented after a CA selects an error segment. At block 1370, theprocessor stores an indication that the error segment was identified bythe CA. Again, in some cases, the test segment, error segment andrelated voice clip may be stored to memorialize the error correction.After block 1370, control passes back up to block 1352 where the processcontinues to cycle.

In some cases errors may only be introduced during periods when the rateof actual ASR engine errors and CA corrections is low. For instance,where a CA is routinely making error corrections during a one minuteperiod, it would make no sense to introduce more text errors as the CAis most likely highly focused during that period and her attention isneeded to ensure accurate error correction. In addition, if a CA issubstantially delayed in making corrections, the system may again opt tonot introduce more errors.

Error introductions may include text additions, text deletions (e.g.,removal of text so that the text is actually missing from thetranscript) and text substitutions in some embodiments. In at least somecases the error generating processor or CA interface may randomlygenerate errors of any type and related to any ASR generated text. Inother cases, the processor may be programmed to introduce severaldifferent types of errors including visible errors (e.g., defined aboveas errors that are clear errors when placed in context with other wordsin a text phrase, e.g., the phrase does not make sense when theerroneous text is included), invisible errors (e.g., errors that makesense and a grammatically right in the context of surrounding words),minor errors which are errors that, while including incorrect text, haveno bearing on the meaning of an associated phrase (e.g., “the” swappedfor “a”) and major errors which are errors that include incorrect textand that change the meaning of an associated phrase (e.g., swapping a 5PM meeting time for a 3 PM meeting time). In some cases an error mayhave two designations such as, for instance, visible and major, visibleand minor, invisible and major or invisible and minor.

Because at least some ASR engines can understand context, the enginescan also be programmed to ascertain when a simple text error affectsphrase meaning and can therefore generate and identify different errortypes to test a CAs correction skills. For instance, in some casesintroduced errors may include visible, invisible, minor and major errorsand statistics related to correcting each error type may be maintainedas well as when a correction results in a different error. For instance,an invisible major error may be presented to a CA and the CA mayrecognize that error and incorrectly correct it to introduce a visibleminor error which, while still wrong, is better than the invisible majorerror. Here, statistics would reflect that the CA identified andcorrected the invisible major error but made an error when correctingwhich resulted in a visible minor error. As another instance, a visibleminor error may be incorrectly corrected to introduce an invisible majorerror which would generate a much worse captioning result that couldhave substantial consequences. Here, statistics would reflect that theCA identified and corrected the initial error which is good, but wouldalso reflect that the correction made introduced another error and thatthe new error resulted in a worse transcription result.

In some embodiments gamification can be enhanced by generating ongoing,real time dynamic scores for CA performance including, for instance, ascore associated with accuracy, a separate score associated withcaptioning speed and/or separate speed and accuracy scores underdifferent circumstances such as, for instance, for male and femalevoices, for east coast accents, Midwest accents, southern accents, etc.,for high speed talking and slower speed talking, for captioning withcorrecting versus captioning alone versus correcting ASR engine text,and any combinations of factors that can be discerned. In FIG. 40 ,exemplary accuracy and speed scores that are updated in real time for anongoing call are shown at 1343 and 1345, respectively. Where a callpersists for a long time, a rolling most recent sub-period of the callmay be used as a duration over which at least current scores arecalculated and separate scores for associated with an entire call may begenerated and stored as well.

CA scores may be stored as part of a CA profile and that profile may beroutinely updated to reflect growing CA effectiveness with experienceover time. Once CA specific scores are stored in a CA profile, thesystem may automatically route future calls that have characteristicsthat match high scores for a specific CA to that CA which shouldincrease overall system accuracy and speed. Thus, for instance, if an HUprofile associated with a specific phone number indicates that anassociated HU has a strong southern accent and speaks rapidly, when acall is received that is associated with that phone number, the systemmay automatically route the call to a CA that has a high gamificationscore for rapid southern accents if such a CA is available to take thecall. In other cases it is contemplated that when a call is received ata relay where the call cannot be associated with an existing HU voiceprofile, the system may assign the call to a first CA to commencecaptioning where a relay processor analyzes the HU voice during thebeginning of the call and identifies voice characteristics (e.g., rapid,southern, male, etc.) and automatically switches the call to a second CAthat is associated with a high gamification score for the specific typeof HU voice. In this case, speed and accuracy would be expected toincrease after the switch to the second CA.

Similarly, if a call is routed to one CA based on an incoming phonenumber and it turns out that a different HU voice is present on the callso that a better voice profile fits the HU voice, the call may beswitched from an initial CA to a different CA that is more optimal forthe HU voice signal. In some cases a CA switch mid-call may only occurif some threshold level of delay or captioning errors is detected. Forinstance, if a first assigned CA's delay and error rate is greater thanthreshold values and a system processor recognizes HU voicecharacteristics that are much better suited to a second available CA'sskill set and profile, the system may automatically transition the callfrom the first CA to the second CA.

In addition, in some cases it is contemplated that in addition to theindividual speed and accuracy scores, a combined speed/accuracy scorecan be generated for each CA over the course of time, for each CA over awork period (e.g., a 6 hour captioning day), for each CA for each callthat the CA handles, etc. For example, an exemplary single scorealgorithm may including a running tally that adds one point for acorrect word and adds zero points for an incorrect word, where thecorrect word point is offset by an amount corresponding to a delay inword generation after some minimal threshold period (e.g., 2 secondsafter the word is broadcast to the CA for transcription or one secondafter the word is broadcast to and presented to a CA for correction).For instance, the offset may be 0.2 points for every second after theminimal threshold period. Other algorithms are contemplated. The singlescore may be presented to a CA dynamically and in real time so that CAis motivated to focus more. In other cases the single score per phonecall may be presented at the end of each call or an average score over awork period may be presented at the end of the work period. In FIG. 40 ,an exemplary current combined score is shown at 1347.

The single score or any of the contemplated metrics may also be relatedto other factors such as, for instance:

(1) How quickly errors are corrected by a CA;

(2) How many ASR errors need to be corrected in a rolling period oftime;

(3) ASR delays;

(4) How many manufactured or purposefully introduced errors are caughtand corrected;

(5) Error types (e.g., visible, invisible, minor and major)

(6) Correct and incorrect corrections;

(7) Effect of incorrect corrections and non-corrections (e.g., bettercaption or worse caption);

(8) Rates of different types of corrections;

(9) Error density;

(10) Once a CA is behind, how does the CA respond, rate of catchup;

(11) HU speaking rate (WPM);

(12) HU accent or dialect;

(13) HU volume, pitch, tone, changes in audible signal characteristics;

(14) Voice signal clarity (perhaps as measured by the ASR engine);

(15) Communication link quality;

(16) Noise level (e.g., HU operating in high wind environment wherenoise is substantial and persistent);

(17) Quality of captioned sentence structure (e.g., verb, noun, adverb,in acceptable sequence);

(18) ASR confidence factors associated with text generated during a call(as a proxy for captioning complexity), etc.

In at least some embodiments where gamification and training processesare applied to actual AU-HU calls, there may be restrictions on abilityto store captions of actual conversations. Nevertheless, in these cases,captioning statistics may still be archived without saving caption textand the statistics may be used to drive scoring and gamificationroutines. For instance, for each call, call characteristics may bestored including, for instance, HU accent, average HU voice signal rate,highest HU voice signal rate, average volume of HU voice signal, othervoice signal defining parameters, communication line clarity or otherline characteristics, etc. (e.g., any of the other factors listedabove). In addition, CA timing information may be stored for each audiosegment in the call, for captioned words and for corrective CAactivities.

As in the case of the full or pure CA metrics testing and developmentsystem described above, in at least some cases real AU-HU calls may bereplaced by pre-recorded text call data sets where audio is presented toa CA while mock ASR engine text associated therewith is visuallypresented to the CA for correction. In at least some cases, thepre-stored test data set may only include a mocked up HU voice signaland known correct or true text associated therewith and the systemincluding an ASR engine may operate in a normal fashion so the ASRengine generates real time text including ASR errors for the mocked upHU voice signal as a CA views that ASR text and makes corrections. Here,as the CA generates corrected final text, a system processor mayautomatically compare that text to the known correct or true text togenerate CA call metrics including various scoring values.

In other cases, the ASR engine functions may be mimicked by a systemprocessor that automatically introduces known errors of specific typesinto the correct or true text associated with the mocked up HU voicesignal to generate mocked up ASR text that is presented to a CA forcorrection. Here, again, as the CA generates corrected final text, asystem processor automatically compares that text to the known true textto generate CA call metrics including various scoring values.

In still other cases, in addition to storing the test HU voice signaland associated true text, the system may also store a test version oftext associated with the HU voice signal where the test text version hasknown errors of known types and, during a test session, the test textwith errors may be presented to the CA for correction. Here, again, asthe CA generates corrected final text, a system processor automaticallycompares that text to the known true text to generate CA call metricsincluding various scoring values.

In each cases where a mocked up HU voice signal is used during a testsession, the voice signal and CA captioned transcripts can be maintainedand correlated with the CA's results so that the CA and/or a systemadministrator can review those results for additional scoring purposesor to identify other insights into a specific CA's strengths andweaknesses or into CA activities more generally.

In at least some cases CAs may be tested using a testing applicationthat, in addition to generating mock ASR text and ASR corrections for amocked up AU-HU voice call, also simulates other exemplary and common AUactions during the call such as, for instance, switching from an ASR-CAbacked up mode to a full CA captioning and error correction mode. Here,as during a normal call, the CA would listen to HU voice signal and seeASR generated text on her CA display screen and would edit perceivederrors in the ASR text during the ASR-CA backed up mode operation. Here,the CA would have full functionality to skip around within the ASRgenerated text to rebroadcast HU segments during error correction, tofirm up ASR text, etc., just as if the mocked up call were real. At somepoint, the testing application would then issue a command to the CAstation indicating that the AU requires full CA captioning andcorrection without ASR assistance at which point the CA system wouldswitch over to full CA captioning and correction mode. A switch back tothe ASR-CA backed up mode may occur subsequently.

Where pre-recorded mock HU voice signals are fed to a CA, a Truth/Scorerprocessor may be programmed to automatically use known HU voice signaltext to evaluate CA corrections for accuracy as described above. Here, afinal draft of the CA corrected text may be stored for subsequentviewing and analysis by a system administrator or by the CA to assesseffectiveness, timing, etc.

Where scoring is to be applied to a live AU-HU call that does not use apre-recorded HU voice signal so there is no initial “true” texttranscript, a system akin to one of those described above with respectto one of FIG. 30 or 31 may be employed where a “truth” transcript isgenerated either via another CA or an ASR or a CA correcting ASRgenerated text for comparison and scoring purposes. Here, the second CAthat generates the truth transcript may operate at a much slower pacethan the pace required to support an AU as caption rate is not asimportant and can be sacrificed for accuracy. Once or as the second CAgenerates the truth transcript, a system processor may compare the firstCA captioning results to the truth transcript to identify errors andgenerate statistics related thereto. Here, the truth transcript isultimately deleted so that there is no record of the call and all thatpersists is statistics related to the CA's performance in handling thecall.

In other embodiments where scoring is applied to a live AU-HU call thatdoes not have a predetermined “truth” transcript, the second CA mayreceive the first CA's corrected text and listen to the HU voice signalwhile correcting the first CA's corrected text a second time. In thiscase, a processor tracks corrections by the first CA as well asstatistics related to one or any subset of the call factors (e.g., rateof speech, number of ASR text errors per some number of words, etc.)listed above. In addition, the processor tracks corrections by thesecond CA where the second CA corrections are considered the Truthtranscript. Thus, any correction made by the second CA is taken as anerror.

In at least some cases, instead of just identifying CA caption errorsgenerally, either a system processor or a second CA/scorer maycategorize each error as visible (e.g., in context of phrase, errormakes no sense), invisible (e.g., in context of phrase error makes sensebut meaning of phrase changes) or minor (e.g., error that does notchange the meaning of including phrase). Where a scoring second CA hasto identify error type in a case where a mock AU-HU call is used as thesource for CA correction, a processor may present a screenshot to thesecond CA where all errors are identified and as well as tallying toolsfor adding each error to one of several error type buckets.

To this end, see FIG. 51 where an exemplary CA scoring screen shot 1568is illustrated. The screen shot 1568 includes a CA text transcript at1572 that includes corrections by a first CA that is being scored by aCA scorer (e.g., a system manager or administrator). While scoring thetext, the scorer listens to the HU voice signal via a headset and, in atleast some cases, a word associated with a currently broadcast HU voicesignal is highlighted to aid the scorer in following along. In theillustrated embodiment, a system processor compares the CA correctedtext to a truth transcript and identifies transcription errors. Eacherror in FIG. 51 is visually distinguished. For instance, see exemplaryfield indicators 1574, 1576, 1578 and 1580, each of which represents anerror.

Referring still to FIG. 51 , as the scorer works her way through the CAtext transcript considering each error, the scorer uses judgement todetermine if the error is a major error or a minor error and designateseach error either major or minor. For instance, a scorer may use a mouseor touch to select each error and then use specific keyboard keys toassign different error types to each error. In the illustrated example,a “V” keyboard selection designates an error as a major error while an“F” selection designates the error as a minor error. In FIG. 51 , eachtime an error type is assigned to an error, a V1 or F1 designator isspatially associated with the error on the screen shot 1568 so that theerror type is clear. In addition, when an error type is assigned to anerror, the designated error is visually distinguished in a differentfashion to help the scorer track which errors have been characterizedand which have not. For instance, in FIG. 51 , fields 1574 and 1576 areshown as left up to right cross hatched to indicate a red colorindicating that associated errors have ben categorized while fields 1578and 1580 are shown left down to right cross hatched to indicate a bluecolor reserved for errors that have yet to be considered and categorizedby the scorer.

In addition, when an error type is assigned to an error, a counterassociated with the error type is incremented to indicate a total countfor that specific type of error. To this end, a counter field 1570 ispresented along the top edge of the screen shot 1568 that includesseveral counters including a major error counter and a minor errorcounter at 1598 and 1600, respectively. The final counts are used togenerate various metrics related to CA quality and effectiveness.

In at least some cases a scorer may be able to select an error field toaccess associated text from the truth transcript that is associated withthe error. To this end, see in FIG. 51 where hand icon 1594 indicatesuser selection of error field 1578 which opens up truth text field 1596in which associated truth text is presented. In the example, the name“Jane” is the truth text for the error “Jam”. Thus, the scorer caneither listen to the broadcast voice or view truth text to compare toerror text for assessing error type.

Referring still to FIG. 51 , missing text is also an error and isrepresented by the term “% missing” as shown at 1580. Here, again, thescorer can select the missing text field to view truth text associatedtherewith in at least some embodiments.

A “non-error” is erroneous text that could not possibly be confusing tosomeone reading a caption. For instance, exemplary non-errors includealternate spellings of a word, punctuation, spelled out numbers insteadof numerals, etc. Here, while the system may flag non-errors between atruth text and CA generated text, the scorer may un-flag those errors asthey are effectively meaningless. The idea here is that on balance, itis better to have faster captioning with some non-errors than slowercaptioning where there are no non-errors and therefore, at a minimum,CAs should not be penalized for purposefully or even unintentionallyallowing non-errors. When a scorer un-flags a non-error, the appearanceof the non-error is changed so that it is not visually distinguishedfrom other correct text in at least some embodiments. In addition, whena scorer un-flags a non-error, a value in a non-error count field 1602is incremented by one.

In at least some cases a scorer can highlight word or phrases in a textcaption causing a processor to indicate durations of silence prior tothe selected word or each word in a selected phrase. To this end, see,for instance, the highlighted phrase “may go out and catch a movie” inFIG. 51 where pre-word delays are shown before each word in thehighlighted phrase including, for instance, delays 1605 and 1607corresponding to the words “may” and “go”, respectively. Here, a scorercan use the delays to develop a sense of whether or not words repeatedin CA corrected text are meaningful. For instance, where a CA correctedtranscript includes the phrase “no no”, whether or not this wordduplication is meaningful or not may be dependent upon the delay betweenthe two words. For instance, where there is no delay between the words,the duplication was not necessary as one “no” would have gotten themeaning across. On the other hand, where there is a several second delaybetween the first and second “no” utterances in the HU voice signal,that indicated that each word was a separate answer (e.g., the end ofone sentence and the beginning of another). A scorer can use this typeof information as another metric for scoring CA performance.

One other way to monitor CA attention is to present random or periodicindicators into the ASR engine text that the CA has to recognize withinthe text in some fashion to confirm the CA's attention. For instance,referring again to FIG. 36 , in some cases a separate check box may bepresented for each ASR transcript line of text as shown at 1610 where aCA has to select each box to place an “X” therein to indicate that theline has been examined. In other cases check boxes may be interspersedthroughout the transcript text presented to the CA and the CA may needto select each of those boxes to confirm her attention.

Other AU Device Features and Processes

In at least some of the embodiments described above an AU has the optionto request CA assistance or more CA assistance than currently affordedon a call and or to request ASR engine text as opposed to CA generatedtext (e.g., typically for privacy purposes). While a request to changecaption technique may be received from a CA, in at least some cases thealternative may not be suitable for some reason and, in those cases, thesystem may forego a switch to a requested technique and provide anindication to a requesting AU that the switch request has been rejected.For instance, if an AU receiving CA generated and corrected textrequests a switch to an ASR engine but accuracy of the ASR engine isbelow some minimal threshold, the system may present a message to the AUthat the ASR engine cannot currently support captioning and the CAgeneration and correction may persist. In this example, once the ASRengine is ready to accurately generate text, the switch thereto may beeither automatic or the system may present a query to the AU seekingauthorization to switch over to the ASR engine for subsequentcaptioning.

In a similar fashion, if an AU requests additional CA assistance, asystem processor may determine that ASR engine text accuracy is low forsome reason that will also affect CA assistance and may notify the AUthat the a switch will not be made along with a reason (e.g.,“Communication line fault”).

In cases where privacy is particularly important to an AU on a specificcall or generally, the caption system may automatically, upon requestfrom an AU or per AU preferences stored in a database, initiate allcaptioning using an ASR engine. Here, where corrections are required,the system may present short portions of an HU's voice signal to aseries of CAs so that each CA only considers a portion of the text forcorrection. Then, the system would stitch all of the CA corrected texttogether into an HU text stream to be transmitted to the AU device fordisplay.

In some cases it is contemplated that an AU device interface may presenta split text screen to an AU so that the AU has the option to viewessentially real time ASR generated text or CA corrected text when thecorrected text substantially lags the ASR text. To this end, see theexemplary split screen interface 1450 in FIG. 45 where CA corrected textis shown in an upper field 1452 and “real time” ASR engine text ispresented in a lower field 1454. As shown, a “CA location” tag 1456 ispresented at the end of the CA corrected text while a “Broadcast” tag1458 is presented at the end of the ASR engine text to indicate the CAand broadcast locations within the text string. Where CA correctionlatency reaches a threshold level (e.g., the text between the CAcorrection location and the most recent ASR text no longer fits on thedisplay screen), text in the middle of the string may be replaced by aperiod indicator to indicate the duration of HU voice signal at thespeaking speed that corresponds to the replaced text. Here, as the CAmoves on through the text string, text in the upper field 1452 scrollsup and as the HU continued to speak, the ASR text in the bottom field1454 also scrolls up independent of the upper field scrolling rate.

In at least some cases it is contemplated that an HU may use acommunication device that can provide video of the HU to an AU during acall. For instance, an HU device may include a portable tablet typecomputing device or smart phone (see 1219 in FIG. 33 ) that includes anintegrated camera for telepresence type communication. In other cases,as shown in FIG. 33 , a camera 1123 may be linked to the HU phone orother communication device 14 for collecting HU video when activated.Where HU video is obtained by an HU device, in most cases the video andvoice signals will already be associated for synchronous playback. Here,the HU voice and video signals are transmitted to an AU device, the HUvideo may be broken down into video segments that correspond with timestamped text and voice segments and the stamped text, voice and videosegments may be stored for simultaneous replay to the AU as well as to aCA if desired. Here, where there are delays between broadcast ofconsecutive HU voice segments as text transcription progresses, in atleast some cases the HU video will freeze during each delay. In othercases the video and audio voice signal may always be synchronized evenwhen text is delayed. If the HU voice signal is sped up during a catchup period as described above, the HU video may be shown at a fasterspeed so that the voice and video broadcasts are temporally aligned.

FIG. 42 shows an exemplary AU device screen shot 1308 includingtranscribed text 1382 and a video window or field 1384. Here, assumingthat all of the shown text at 1382 has already been broadcast to the AU,if the AU selects the phrase “you should bing the cods along” asindicate by hand icon 1386, the AU device would identify the voicesegment and video segment associated with the selected text segment andreplay both the voice and video segments while the phrase remainshighlighted for the user to consider.

Referring yet again to FIG. 33 , in some cases the AU device or AUstation may also include a video camera 1125 for collecting AU videothat can be presented to the HU during a call. Here, it is contemplatedthat at least some HUs may be reticent to allow an AU to view HU videowithout having the reciprocal ability to view the AU during an ongoingcall and therefore reciprocal AU viewing would be desirable.

At least four advantages result from systems that present HU video to anAU during an ongoing call. First, where the video quality is relativelyhigh, the AU will be able to see the HU's facial expressions which canincrease the richness of the communication experience.

Second, in some cases the HU representation in a video may be useable todiscern words intended by an HU even if a final text representationthereof is inaccurate. For instance, where a text transcription erroroccurs, an AU may be able to select the phrase including the error andview the HU video associated with the selected phrase while listening tothe associated voice segment and, based on both the audio and videorepresentations, discern the actual phrase spoken by the HU.

Third, it has been recognized that during most conversations, peopleinstinctively provide visual cues to each other that help participantsunderstand when to speak and when to remain silent while others arespeaking. In effect, the visual cues operate to help people take turnsduring a conversation. By providing video representations to each of anHU and an AU during a call, both participants can have a good sense ofwhen their turn is to talk, when the other participant is strugglingwith something that was said, etc. Thus, for instance, in many cases anHU will be able to look at the video to determine if an AU is silentlywaiting to view delayed text and therefore will not have to ask if thereis a delay in AU communication.

Fourth, for deaf AU's that are trained to read lips, the HU video may beuseable by the AU to enhance communication.

In at least some cases an AU device may be programmed to query an HUdevice at the beginning of a communication to determine if the HU devicehas a video camera useable to generate an HU video signal. If the HUdevice has a camera, the AU device may cause the HU device to issue aquery to the HU requesting access to and use of the HU device cameraduring the call. For instance, the query may include brief instructionsand a touch selectable “Turn on camera” icon or the like for turning onthe HU device camera. If the HU rejects the camera query, the system mayoperate without generating and presenting an HU video as describedabove. If the HU accepts the request, the HU device camera is turned onto obtain an HU video signal while the HU voice signal is obtained andthe video and voice signal are transmitted to the AU device for furtherprocessing.

There are video relay systems on the market today where speciallytrained CAs provide a sign language service for deaf AUs. In thesesystems, while an HU and an AU are communicating via a communicationlink or network, an HU voice signal is provided to a CA. The CA listensto the HU voice signal and uses her hands to generate a sequence ofsigns that correspond at least roughly to the content (e.g., meaning) ofthe HU voice messages. A video camera at a CA station captures the CAsign sequence (e.g., “the sign signal”) and transmits that signal to anAU device which presents the sign signal to the AU via a display screen.If the AU can speak, the AU talks into a microphone and the AU's voiceis transmitted to the HU device where it is broadcast for the HU tohear.

In at least some cases it is contemplated that a second or even a thirdcommunication signal may be generated for the HU voice signal that canbe transmitted to the AU device and presented along with the sign signalto provide additional benefit to the AU. For instance, it has beenrecognized that in many cases, while sign language can come close to themeaning expressed in an HU voice signal, in many cases there is no exacttranslation of a voice message to a sign sequence and therefore somemeaning can get lost in the voice to sign signal translation. In thesecases, it would be advantageous to present both a text translation and asign translation to an AU.

In at least some cases it is contemplated that an ASR engine at a relayor operated by a fourth party server linked to a relay may, in parallelwith a CA generating a sign signal, generate a text sequence for an HUvoice signal. The ASR text signal may be transmitted to an AU devicealong with or in parallel with the sign signal and may be presentedsimultaneously as the text and sign signals are generated. In this way,if an AU questions the meaning of a sign signal, the AU can refer to theASR generated text to confirm meaning or, in many cases, review anactual transcript of the HU voice signal as opposed to a sometimes lessaccurate sign language representation.

In many cases an ASR will be able to generate text far faster than a CAwill be able to generate a sign signal and therefore, in at least somecases, ASR engine text may be presented to an AU well before a CAgenerated sign signal. In some cases where an AU views, reads andunderstands text segments well prior to generation and presentation of asign signal related thereto, the AU may opt to skip ahead and foregosign language for intervening HU voice signal. Where an AU skips aheadin this fashion, the CA would be skipped ahead within the HU voicesignal as well and continue signing from the skipped to point on.

In at least some cases it is contemplated that a relay or other systemprocessor may be programmed to compare text signal and sign signalcontent (e.g., actual meaning ascribed to the signals) so that timestamps can be applied to text and sign segment pairings thus enabling anAU to skip back through communications to review a sign signalsimultaneously with a paired text tag or other indicator. For instance,in at least some embodiments as HU voice is converted by a CA to signsegments, a processor may be programmed to assess the content (e.g.,meaning) of each sign segment. Similarly, the processor may also beprogrammed to analyze the ASR generated text for content and to thencompare the sign segment content to the text segment content to identifymatching content. Where sign and text segment content match, theprocessor may assign a time stamp to the content matching segments andstore the stamp and segment pair for subsequent access. Here, if an AUselects a text segment from her AU device display, instead of (or inaddition to in some embodiments) presenting an associated HU voicesegment, the AU device may represent the sign segment paired with theselected text.

Referring again to FIG. 33 , the exemplary CA station includes, amongother components, a video camera 55 for taking video of a signing CA tobe delivered along with transcribed text to an AU. Referring also andagain to FIG. 42 , a CA signing video window is shown at 1390 alongsidea text field that includes text corresponding to an HU voice signal. InFIG. 42 , if an AU selects the phrase labelled 1386, that phrase wouldbe visually highlighted or distinguished in some fashion and theassociated or paired sign signal segment would be represented in window1390.

In at least some video relay systems, in addition to presenting sign andtext representations of an HU voice signal, an HU video signal may alsobe used to represent the HU during a call. In this regard, see againFIG. 42 where both an HU video window 1384 and a CA signing window 1390are presented simultaneously. Here, all communication representations1382, 1384 and 1390 may always be synchronized via time stamps in somecases while in other cases the representations may not be completelysynchronized. For instance, in some cases the HU video window 1384 mayalways present a real time representation of the HU while text and signsignals are 1382 and 1390 are synchronized and typically delayed atleast somewhat to compensate for time required to generate the signsignal as well as AU replay of prior sign signal segments.

In still other embodiments it is contemplated that a relay or othersystem processor may be programmed to analyze sign signal segmentsgenerated by a signing CA to automatically generate text segments thatcorrespond thereto. Here the text is generated from the sign signal asopposed to directly from the voice signal and therefore would match thesign signal content more closely in at least some embodiments. Becausethe text is generated directly from the sign signal, time stamps appliedto the sign signal can easily be aligned with the text signal and therewould be no need for content analysis to align signals. Instead of usingcontent to align, a sign signal segment would be identified and a timestamp applied thereto, then the sign signal segment would be translatedto text and the resulting text would be stored in the system databasecorrelated to the corresponding sign signal segment and the time stampfor subsequent access.

FIG. 44 shows yet another exemplary AU screen shot 1400 where textsegments are shown at 1402 and an HU video window is shown at 1412. Thetext 1402 includes a block of text where the block is presented in threevisually distinguished ways. First, a currently audibly broadcast wordis highlighted or visually distinguished in a first way as indicated at1406. Second, the line of text that includes the word currently beingbroadcast is visually distinguished in a second way as shown at 1404.Other text lines are presented above and below the line 1404 to showpreceding text and following text for context. In addition, the line at1404 including the currently broadcast word at 1406 is presented in alarger format to call an AU's attention to that line of text and theword being broadcast. The larger text makes it easier for an AU to seethe presented text. Moreover, the text block 1402 is controlled toscroll upward while keeping the text line that includes the currentlybroadcast word generally centrally vertically located on the AU devicedisplay so that the AU can simply train her eyes at the central portionof the display with the transcribed words scrolling through the field1404. In this case, a properly trained AU would know that priorbroadcast words can be rebroadcast by tapping a word above field 1404and that the broadcast can be skipped ahead by tapping one of the wordsbelow field 1404. Video window 1412 is provided spatially close to field1404 so that the text presented therein is intuitively associated withthe HU video in window 1412.

In at least some embodiments it is contemplated that when a CA replacesan ASR engine to generate text for some reason where the CA revoices anHU voice signal to the ASR engine to generate the text, instead ofproviding the voice signal re-voiced by the CA to an ASR engine at therelay, the CA revoicing signal may be routed to the ASR engine that wasbeing used prior to convert the HU voice signal to text. Thus, forinstance, where a system was transmitting an HU voice signal to a fourthparty ASR engine provider when a CA takes over text generation viare-voicing, when the CA voices a word, the CA voice signal may betransmitted to the fourth party provider to generate transcribed textwhich is then transmitted back to the relay and on to the AU device forpresentation.

In at least some cases it is contemplated that a system processor maytreat at least some CA inputs into the system differently as a functionof how well the ASR is likely performing. For instance, as describedabove, in at least some cases when a CA selects a word in a texttranscript on her display screen for error correction, in normaloperation, the selected word is highlighted for error correction. Here,however, in some cases what happens when a CA selects a text transcriptword may be tied to the level of perceived or likely errors in thephrase that includes the selected word. Where a processor determinesthat the number of likely errors in the phrase is small, the system mayoperate in the normal fashion so that only the selected word orsub-phrase (e.g., after word selection and a swiping action) ishighlighted and prepared for replacement or correction and where theprocessor determines that the number of likely errors in the phrase islarge (e.g., the phrase is predictably error full), the system mayoperate to highlight the entire error prone phrase for error correctionso that the CA does not have to perform other gestures to select theentire phrase. Here, when an entire phrase is visually distinguished toindicate ability to correct, the CA microphone may be automaticallyunmuted so the CA can revoice the HU voice signal to rapidly generatecorrected text.

In other cases, while a simple CA word selection may cause that word tobe highlighted, some other more complex gesture after word selection maycause the phrase including the word to be highlighted for editing. Forinstance, a second tap on a word that immediately follows the wordselection may cause a processor to highlight an entire word containingphrase for editing. Other gestures for phrase, sentence, paragraph,etc., selection are contemplated.

In at least some embodiments it is contemplated that a system processormay be programmed to adjust various CA station operating parameters as afunction of a CA's stored profile as well as real time scoring of CAcaptioning. For instance, CA scoring may lead to a CA profile thatindicates a preferred or optimal rate of HU voice signal broadcast(e.g., in words per minute) for a specific CA. Here, the system mayautomatically use the optimal broadcast rate for the specific CA. Asanother instance, a processor may monitor the rate of CA captioning, CAcorrecting and CA error rates and may adjust the rate of HU voice signalbroadcast that results in optimal time and error rate statistics. Here,the rate may be increased during a beginning portion of a CA'scaptioning shift until optimal statistics result. Here, if statisticsfall off at any time, the system may slow the HU voice signal broadcastrate to maintain errors within an acceptable range.

In some cases a CA profile may specify separate optimal system settingsfor each of several different HU voice signal types or signalcharacteristics subsets. For instance, for a first CA, a first HU voicesignal broadcast rate may be used for a Hispanic HU voice signal while asecond relatively slower HU voice signal broadcast rate may be used fora Caucasian HU voice signal. Many other HU voice signal characteristicsubsets and associated optimal station operating characteristics arecontemplated.

ASR-CA Backed Up Mode

While several different types of semi-automated systems have beendescribed above, one particularly advantageous system includes anautomatic speech recognition system that at least initially handlesincoming HU voice signal captioning where the ASR generated text iscorrected by a CA and where the CA has the ability to manually (e.g.,via selectin of button or the like) take over captioning whenever deemednecessary. Hereinafter, unless indicated otherwise, this type of ASRtext first and CA correction second system will be referred to as anASR-CA backed up mode. Advantages of an ASR-CA backed up mode includethe following. First, initial caption delay is minimized and remainsrelatively consistent so that captions can be presented to an AU asquickly as possible. To this end, ASR engines generate initial captionsrelatively quickly when compared to CA generated text in most cases insteady state.

Second, caption errors associated with current ASR engines can beessentially eliminated by a CA that only corrects ASR errors in mostcases and final corrected text can be presented to an AU rapidly.

Third, by combining rapid ASR text with the error correction skills of aCA, it is possible to mix those capabilities in different ways toprovide optimal captioning speed and accuracy regardless ofcharacteristics of different calls that are fielded by the captioningsystem.

Fourth, the combination of rapid ASR text and CA error correctionenables a system where an AU can customize their captioning system inmany different ways to suit their own needs and system expectations toenhance their communication capabilities.

While various aspects of an ASR-CA backed up mode have been describedabove, some of those aspects are described in greater detail andadditional aspects are described hereafter.

While an ASR engine is typically much faster at generating initialcaption text than a CA, in at least some specific cases a CA may in factbe faster than an ASR engine. Whether or not CA captioning is likely tobe faster than ASR captioning is often a function of several factorsincluding, for instance, a CA's particular captioning strengths andweaknesses as well as characteristics of an HU voice signal that is tobe captioned. For instance, a specific first CA may typically rapidlycaption Hispanic voice signals but may only caption Midwestern voicesignals relatively slowly so that when captioning a Hispanic signal theCA speed can exceed the ASR speed while the CA typically cannot exceedthe ASR speed when captioning a Midwestern voice signal. As anotherinstance, while an ASR may caption high quality HU voice signal fasterthan the first CA, the first CA may caption low quality HU voice signalfaster than the ASR.

As described above, in some cases the system may present an option (seecaption source switch button 751 in FIG. 23A) for a CA to change fromthe ASR generating original text and the CA correcting that text to asystem where the CA generates original text and corrects errors and inother cases a system processor may automatically change the system overto CA original and corrected text when the ASR is too slow, isgenerating too many meaningful (e.g. “visible”, changing the meaning ofa phrase) transcription errors, or any combination of both. In stillother cases a system processor that determines that a specific CA, basedon CA strengths and HU voice signal characteristics, would likely beable to generate initial text faster than the ASR, may be programmed tooffer a suggestion to the CA to switch over.

Thus, in some cases the caption source switch button 751 in FIG. 23A mayonly be presented to a CA as an option when a system processordetermines that the specific CA should be able to generate fasterinitial captions for an HU voice signal. In an alternative, button 751may always be presented to a CA but may have two different appearancesincluding the full button for selection and a greyed out appearance toindicate that the button is not selectable. Here, by presenting thegreyed out button when not selectable a user will not be confused whenthat button is absent.

In some cases it may be that it has to be likely a CA can speed uptranscription appreciably prior to presenting button 751 so that smallpossible increases in speed do not cause a suggestion to be presented tothe CA which could simply distract the CA from error correction. Forinstance, in an exemplary case, a processor may have to calculate thatit is likely a specific CA can speed up transcription by 15% or more inorder to present button 751 to the CA for selection.

In some cases the system processor may take into account more thaninitial captioning speed when determining when to present caption sourceswitch button 751 to a CA. For instance, in some cases the processor mayaccount for some combination of speed and some factor related to thenumber of transcription errors generated by an ASR to determine when topresent button 751. Here, how speed and accuracy factors are weighed todetermine when button 751 should be presented to a user may be a matterof designer choice and should be set to create a best possible AUexperience.

In at least some cases it is contemplated that when the systemautomatically switches to full CA captioning and correction or the CAselects button 751 to switch to full CA captioning and correction, theASR may still operate in parallel with the CA to generate a secondinitial version (e.g., a second to the CA generated captions) of the HUvoice signal and the system may transmit whichever captions aregenerated first (e.g., ASR or CA) to the AU device for presentation.Here, it has been recognized that even when a CA takes over fullcaptioning and correction, which captioning is fastest, ASR or CA, mayswitch back and forth and, in that case, the fastest captions shouldalways be provided to the AU.

As recognized above, in at least some cases third party (e.g., a serverin the cloud) ASR engines have at least a couple of shortcomings. First,third party ASR engine accuracy tends to decrease at the end ofrelatively long voice signal segments to be transcribed.

Second, ASR engines use context to generate final transcription resultsand therefore are less accurate when input voice segments are short. Tothis end, initial ASR results for a word in a voice signal are typicallybased on phonetics and then, once initial results for severalconsecutive words in a signal are available, the ASR engine uses thecontext of the words together as well as additional characteristics ofthe voice of the speaker generating the voice signal to identify a bestfinal transcription result for each word. Where a voice segment in anASR request is short, the signal includes less context in the segmentfor accurately identifying a final result and therefore the results tendto be less accurate.

Third, final results tend to be generated in clumps which means thatautomated ASR error corrections presented to a CA or an AU tend to bepresented I spurts which can be distracting. For instance, if fiveconsecutive words are changed in text presented on an AU's devicedisplay at the same time, the changes can be distracting.

As described above, one solution to the third party ASR shortcomings isto divide an HU voice signal into signal slices that overlap to avoidinaccuracies related to long duration signal segments. In addition, tomake sure that all final transcription results are contextuallyinformed, each segment slice should be at least some minimum segmentlength to ensure sufficient context. Ideally, segment slices sent to theASR engine as transcription requests would include a predefined numberof words within a range (e.g., 3 to 15 words) where the range isselected to ensure at least some level of context to inform the finalresult. Unfortunately, an HU voice signal is not transcribed prior tosending it to the ASR engine and therefore there is no way to ascertainthe number of words in a voice segment prior to receiving transcriptionresults back from the ASR.

For this reason segment slices have to be time based as opposed to wordcount based where the time range of each segment is selected so that itis likely the segment includes an optimal number (e.g., 3 to 15 words)of words spoken by an HU. In at least some cases the time range will bebetween 1 and 10 seconds and, in particularly advantageous cases, therange is between 1 and 3 seconds.

Once initial and/or final transcription results are received back at arelay for one or more HU voice signal segments, a relay processor maycount the number of words in the transcription and automatically adjustthe duration of each HU voice signal segment up or down to adapt to theHU's rate of speech so that each subsequent segment slice has thegreatest chance of including an optimal number of words. Thus, forinstance, where an HU talks extremely quickly, an initial segment sliceduration of four seconds may be shortened to a two second duration.

In at least some cases a relay may only use central portions of ASRtranscribed HU voice signal slices for final transcription results toensure that all final transcribed words are contextually informed. Thus,for instance, where a typical voice signal slice includes 12 words, therelay processor may only use the third through ninth words in anassociated transcription to correct the initial transcription so thatall of the words used in the final results are context informed.

As indicated above, consecutive HU voice segment slices sent to ASRengines may be overlapped to ensure no word is missed. Overlappingsegments also has the advantage that more context can be presented foreach final transcription word. At the extreme the relay may transmit aseparate ASR transcription request for each sub-period that is likely tobe associated with a word (e.g., based on HU speaking rate or average HUspeaking rate) and only one or a small number of transcribed words in areturned text segment may be used as the final transcription result. Forinstance, where overlapping segments each return an average of sevenfinal transcribed words, the relay may only use the middle three ofthose words to correct initial text presented to the CA and the AU.

Where ASR transcription requests include overlapping HU voice signalsegments, consecutive requests will return duplicative transcriptions ofthe same words. In at least some cases the relay processor receivingoverlapping text transcriptions will identify duplicative wordtranscriptions and eliminate duplication in initial text presented tothe CA and the AU as well as in final results.

In at least some cases it is contemplated that overlapping ASR requestsmay correspond to different length HU voice signal segments where someof the segment lengths are chosen to ensure rapid (e.g., essentiallyimmediate) captions and rapid intermediate correction results whileother lengths are chosen to optimize for context informed accuracy infinal results. To this end, a first set of ASR requests may includeshort HU voice signal slices to expedite captioning and intermediatecorrection speed albeit while sacrificing some accuracy, and a secondset of ASR requests may be relatively longer so that context informedfinal text is optimally identified.

Referring to FIG. 46 , a schematic is shown that includes a single HUvoice signal line of text where the text is divided into signal segmentsor slices including first through sixth short slices and first, secondand third long slices. The first long slice includes voice signalassociated with the first through third short slices. The first longslice includes many words usable for immediate initial transcription aswell as for final contextual transcription correction. Each long sliceword is transmitted to an ASR engine essentially immediately as the HUvoices the segment (e.g., a link to the ASR is opened at the beginningof the long slice and remains open as the HU voices the slice). Initialtranscription of each word in the first long slice is almost immediateand is fed immediately to the CA for manual correction and to the AU asan initial text transcription irrespective of transcription errors thatmay exist. As more first slice words are voiced and transmitted to theASR engine, those words are immediately transcribed and presented to theCA and AU and are also used to provide context for previouslytranscribed words in the first long slice so that errors in the priorwords can be corrected.

Referring still to FIG. 46 , the second long slice overlaps the firstlong slice and includes a plurality of words that correspond to a secondslice duration. To handle the second long slice transcription, a secondASR request is transmitted to an ASR engine as the HU voices each wordin the second slice and substantially real time or immediate text istransmitted back from the engine for each received word. In addition, asthe second slice words are transcribed, those words are also used by theASR engine to contextually correct prior transcribed words in the secondslice to eliminate any perceived errors and those corrections are usedto correct text presented to the CA and the AU.

The third long slice overlaps the second long slice and includes aplurality of words that correspond to a third slice duration. To handlethe third long slice transcription, a third ASR request is transmittedto an ASR engine as the HU voices each word in the third slice andsubstantially real time or immediate text is transmitted back from theengine for each received word. In addition, as the third slice words aretranscribed, those words are also used by the ASR engine to contextuallycorrect prior transcribed words in the third slice to eliminate anyperceived errors and those corrections are used to correct textpresented to the CA and the AU.

It should be apparent from FIG. 46 because long slices overlap, two (andin some cases more) transcriptions for many HU voice signal words willbe received by a relay from one or more ASR engines and therefore arelay processor has to be programmed to select which of the two or moreinitial transcriptions for a word to present to a CA and an AU and whichof two or more final transcriptions for the word to use to correct textalready presented to the CA and AU. In at least some embodiments therelay processor may be programmed to select the first long slice in anHU voice signal for generating initial transcription text for all firstlong slice words, the second long slice in the voice signal forgenerating initial transcription text for all second long slice wordsthat follow the end time of the first long slice and the third longslice in the voice signal for generating initial transcription text forall third long slice words that follow the end time of the second longslice.

In an alternative system, the relay processor may be programmed toselect the first long slice in an HU voice signal for generating initialtranscription text for all first long slice words prior to the starttime of the second long slice, the second long slice in the voice signalfor generating initial transcription text for all second long slicewords prior to the start time of the third long slice and the third longslice in the voice signal for generating initial transcription text forall third long slice words.

In yet one other alternative system, for words that are included inoverlapping signal slices, the relay processor may pass on the firsttranscription of any word that is received by any ASR engine to the CAand AU devices to be presented irrespective of which slice included theword. Here, a second or other subsequent initial transcription of analready presented word may be completely ignored or may be used tocorrect the already presented word in some cases.

Referring again to FIG. 46 , regarding final ASR text results for errorcorrection, the first long slice transcription includes more contextualcontent than the second long slice for about the first two thirds of thefirst slice voice signal, the second long slice transcription includesmore contextual content than the first and third long slices for aboutthe central half of the second slice voice signal and so on. Thus, toprovide most accurate ASR transcription error correction, the relayprocessor may be programmed to use final ASR text from sub-portions ofeach long signal slice for error correction including final ASR textfrom the about the first two thirds of the first long slice, about thecentral half of the second long slice and about the last two thirds ofthe third long slice. Here, because the slices are time based as opposedto word based, the exact sub-portion of each overlapping slice used forfinal text results can only be approximate until the text results arereceived back from the ASR engines.

Thus, it should be appreciated that different overlapping voice segmentsor slices may be used to generate initial and final transcriptions ofwords in at least some embodiments where the segments are selected tooptimize for different purposes (e.g., speed or contextual accuracy).

Referring still to FIG. 46 , while shown as consecutive and distinct,consecutive short slices may overlap at least somewhat as describedabove. Each short slice has a relatively short duration (e.g., 1-3seconds) and is transmitted to an ASR engine as the HU voices thesegment (e.g., a link to the ASR is opened at the beginning of the sliceand remains open as the HU voices the slice). Here, initialtranscription of each word in a short segment is almost immediate andcould in some cases be used to provide the initial transcription ofwords to a CA and an AU in at least some embodiments. The advantage ofshorter voice signal slices in ASR transcription requests is that theASR should be able to generate more rapid final text transcriptions forwords in the shorter segments so that error corrections in textpresented to the CA and the AU are completed more rapidly. Thus, forinstance, while an ASR may not finalize correction of text at thebeginning of the first long slice in FIG. 46 until just after that sliceends so that all of the contextual information in that slice isconsidered, a different ASR handling the first short slice wouldcomplete its contextual error correction just after the end time of thefirst short slice. Here because short slice final text is generatedrelatively rapidly and only affects a small text segment, it can be usedto reduce the amount of sporadic large magnitude error corrections thatcan be distracting to a CA or and AU. In other words, short slice finaltext error correction is more regular and generally of smaller magnitudethan long slice final text error correction.

As explained above, one problem with short voice signal slices is thatthere is not enough content (e.g., additional surrounding words) in ashort slice to result in highly accurate final text. Nevertheless, evenshort slice context results in better accuracy than initialtranscription in most cases and can operate as an intermediate textcorrection agent to be followed up by long slice final text errorcorrection. To this end, referring yet again to FIG. 46 , in at leastsome embodiments the long text segments may be used to generate initialtranscribed text presented to a CA and an AU. Intermediate errorcorrections in the initial text may be generated via contextualprocessing of the short signal segments and used immediately as anintermediate error correction for the initial text presented to the CAand AU. Final error correction in the intermediately corrected text maybe generated via contextual processing of the long signal segments andused to finally error correct the intermediately corrected text for boththe CA and the AU.

While initial, intermediate and final ASR text may be presented to eachof the CA and an AU in some cases, in other embodiments the intermediatetext may only be presented to one or the other of the CA and the AU. Forinstance, where initial text results may be displayed for each of the CAand the AU, intermediate results related to contextual processing ofshort voice signal slices may be used to in line correct errors in theCA presented text only to minimize distractions on the AU's displayscreen.

While the signal slicing and initial and final text selection processeshave been described above as being performed by a relay processor, inother embodiments where an AU device or even an HU device links to anASR engine to provide an HU voice signal thereto and receive texttherefrom, the AU or HU device would be programmed to slice the voicesignal for transmission in a similar fashion and to select initial andfinal and in some cases intermediate text to be presented to systemusers in a fashion similar to that described above.

While ASR engines operate well under certain circumstances, they aresimply less effective than pure CA transcription systems under othersets of circumstances. For instance, it has been observed that during afirst short time just after an AU-HU call commences and a second shorttime at the end of the call when accurate content is particularly timesensitive as well as often unclear and rushed, full CA modes have aclear advantage over ASR-CA backed up modes. For this reason, in atleast some embodiments it is contemplated that one type of system mayinitially link the HU portion of a call to a full CA mode where a CAtranscribes text and corrects that text for at least the beginningportion of the call after which the call is converted to an ASR-CAbacked up call where an ASR engine generates initial text and ASRcorrections with a CA further correcting the initial and final ASR text.For instance, in some cases the HU voice signal during the first 10-15seconds of an AU-HU call may be handled by the full CA mode andthereafter the ASR-CA backed up mode may kick in once the ASR hascontext for subsequent words and phrases to increase overall ASRaccuracy.

In some cases only a small subset of highly trained CAs may handle thefull CA mode duties and when the ASR-CA backed up mode kicks in, thecall may be transferred to a second CA that operates as a correctiononly CA most of the time. In other cases a single CA may operate in thefull CA mode as well as in the ASR-CA backed up mode to maintaincaptioning service flow.

It has been recognized that for many AUs that have at least partialhearing capabilities, in most cases during an AU-HU call by far the mostimportant caption text is the text associated with the most recentlygenerated HU voice signal. To this end, in many cases an AU that has atleast partial hearing relies on her hearing as opposed to caption textto understand HU communications. Then, when an AU periodicallymisunderstands an HU voiced word or phrase, the AU will turn todisplayed captions to clarify the HU communication. Here, most AUs wantimmediate correct text in real time as opposed to three or six or moreseconds later after a CA corrects the text so that the corrections areas simultaneous with a real time HU voice signal broadcast as possible.To be clear, in these cases, correct text corresponding to the mostrecent 7 or less seconds of HU voice signal is far more important mostof the time than correct text associated with HU voice signal from 20seconds ago.

In these cases and others where accurate substantially real time text isparticularly important, a captioning system processor may be programmedto enforce a maximum cumulative duration of HU voice signal broadcastpause seconds to ensure that all CA correction efforts are at leastsomewhat aligned with the HU's real time voice signal. For instance, insome cases the maximum cumulative pause signal may be limited to sevenseconds or five seconds or f even three seconds to ensure thatessentially real time corrections to AU captions occur. In other casesthe maximum cumulative delay may be limited by a maximum number of ASRtext words so that, for instance, a CA cannot get more than 3 or 5 or 7words behind the initially generated ASR text.

Referring now to FIG. 52 , an exemplary CA display screen shot 1650 isillustrated that presents ASR text to a CA as the CA listens to ahearing user's voice signal via headset 54 as indicated at 1654. In thiscase, the CA is restricted to editing only text that appears in the mostrecent two lines 1662 and 1664 of the presented text which is visuallydistinguished by an offsetting box labelled 1656. Box 1660 staysstationary as additional ASR generated text is generated and added tothe bottom of the text block 1652 and the on screen text scrolls upward.Again, as in several other figures described above, a system processorhighlights or otherwise visually distinguishes the text word thatcorresponds to the instantaneously broadcast HU voice signal word asshown at 1660. Here, however, when the text 1652 scrolls up one line, ifthe word being broadcast is in the top line 1662 in box 1660 whenscrolling occurs, the broadcast to the CA skips to the first word in thenext line 1664 when a new line of text is added there below. To this endsee FIG. 53 where one line of scrolling occurred while the system wasstill broadcasting a word in line 1662 in FIG. 52 so that thehighlighted and broadcast word is skipped ahead to the word “want” atthe beginning of line 1664.

In some cases a limitation on CA corrections may be based on the maximumamount of text that can be presented on the CA display screen. Forinstance, in a case where only approximately 100 ASR generated words canappear on an AU's display screen, it would make little sense to allow aCA to correct errors in ASR text prior to the most recent 100 wordsbecause it is highly likely that earlier corrections would not bevisible by the AU. Thus, for instance, in some cases a cumulativemaximum seconds delay may be set to 20 seconds where text associatedwith times prior to the 20 second threshold simply cannot be correctedby the CA. In other cases the cumulative maximum delay may be word countbased (e.g., the maximum delay may be no more than 30 ASR generatedwords). In other cases the maximum delay may vary with other sensedparameters such as line signal quality, the HU's speaking rate (e.g.,words per minute actual or average), a CA's current or averagecaptioning statistics, etc.

A CA's ability to correct text errors may be limited in severaldifferent ways. For instance, relatively aged text that a CA can nolonger correct may be visually distinguished (e.g., highlighted,scrolled up into a “firm” field, etc.) in a fashion different from textthat the CA can still correct. As another instance, text that cannot becorrected may simply be scrolled off or otherwise removed from the CAdisplay screen.

Where a CA is limited to a maximum number of cumulative delay seconds,the cumulative delay count may be reduced by any perceived HU silentperiods that occur between a current time and a time that precedes thecurrent time by the instantaneous delay count. Thus, for instance, if acurrent delay second count is 18 seconds, if the most recent 18 secondsincludes a 12 second HU silent period (e.g., during an AU talking turn),then the cumulative delay may be adjusted downward to 6 seconds as thesystem will be able to remove the 12 second silent period from CAconsideration so that the CA can catch up more rapidly.

In at least some cases it has been recognized that signal noise canappear on a communication link where the noise has a volume and perhapsother detected characteristics but that cannot be identified by an ASRengine as articulated words. Most of the time in these cases the noiseis just that, simply noise. In some cases where line signal can clearlybe identified as noise, a period associated with the noise may beautomatically eliminated from the HU voice signal broadcast to a CA forconsideration so that those noisy periods do not slow down CA captioningof actual HU voice signal words. In other cases where an ASR cannotidentify words in a received line signal but cannot rule out the linesignal as noise, a relay processor may broadcast that signal to a CA ata high rate (e.g., 2 to 4 times the rate of HU speech) so that thepossible noisy period is compressed. In most cases where the line signalis actually noise, the CA can simply listen to the expedited signal,recognize the signal as noise, and ignore the signal. In other cases theCA can transcribe any perceived words or may slow down the signal to anormal HU speech rate to better comprehend any spoken words. Here, oncethe ASR recognizes a word in the HU voice signal and generates acaptioned word again, the pace of HU voice signal broadcast can beslowed to the HU's speech rate.

In cases where a CA switches from an ASR-CA backed up mode to a full CAmode, in at least some embodiments, the non-firm ASR generated text iserased from the CA's display screen to avoid CA confusion. Thus, forinstance, referring again to FIG. 23A, if a CA selects the full CAcaptioning/correction button 751 to initiate a pure CA texttranscription and correction process, the CA display screen shot may beswitched to the shot illustrated in FIG. 47 . As shown in FIG. 47 , firmASR text prior to the current word considered by the CA at 781 orcorrected by the CA persists at 783 but ASR generated text thereafter iswiped from the display screen. The label on the caption source switchbutton 751 is changed to now present the CA the option to switch back tothe ASR-CA backed up type system if desired. The seconds behind field isstill present to give the CA a sense of how well she is keeping up withthe HU voice signal.

When a CA changes from the ASR-CR backed up mode to a full CA mode, insome embodiments there will be no change in what the AU sees on herdisplay screen and no way to discern that the change took place so thatthere is no issue with visually disrupting the AU during the switchover.In other embodiments there may be some type of clean break so that theAU has a clear understanding that the captioning process has changed.For instance, see FIG. 48 where, after a CA has selected the full CAmode option, a carriage return occurs after the most recently generatedASR generated text 1500 and a line 1502 is presented to delineateinitial ASR and CA generated text. After line 1502, CA generated text ispresented to the AU as indicated at 1504. Here, all ASR text previouslypresented to the AU persists regardless of whether or not the text isfirm or not and any initial CA generated text that is inconsistent withASR generated text is used to correct the ASR generated text via inlinecorrection so that the ASR generated text that is not firm is notcompletely wiped from the AU's device display screen.

Thus, for instance, in one exemplary system, when a CA takes overinitial captioning from an ASR, while ASR generated text that followsthe point in an HU voice broadcast most recently listened to orcaptioned by a CA is removed from the CA's display screen to avoid CAconfusion, that same ASR generated text remains on the AU's displayscreen so that the AU does not recognize that the switch over to CAcaptioning occurred from the text presented. Then, as the CA re-voicesHU voice signal to generate text or otherwise enters data to generatetext for the HU voice signal, any discrepancies between the ASRgenerated text on the AU display screen and the CA generated text areused to perform in line corrections to the text on the AU display. Thus,to the CA, the initial CA generated text is seen as new text while theAU sees the initial text, up to the end of the prior ASR generated textas in line error corrections.

When a CA initiates a switch from a full CA mode to an ASR-CA backed upmode, the CA display screen shot may switch from a shot akin to the FIG.47 shot back to the FIG. 23A shot where the button 751 caption is againswitched back to “Full CA Captioning/Correction”, the firm text andseconds behind indicator persist at 748A and 755 and where ASR generatednon-firm text is immediately presented at 769 subsequent to the word750A currently broadcast 752A to the CA for consideration andcorrection.

When a CA initiates a switch from a full CA mode to an ASR-CA backed upmode, again, in some embodiments there may be no change in what the AUsees on her display screen and no way to discern that the switch to theASR-CA backed up mode took place so that the AU's visual experience ofthe captioned text is not visually disrupted. In other embodiments theAU display screen shot may switch from a shot akin to the FIG. 48 shotto a screen shot akin to the shot shown in FIG. 49 where a carriagereturn occurs after the most recently generated ASR generated text 1520and a line 1522 is presented to delineate initial CA generated andcorrected text from following ASR generated and CA corrected text. Afterline 1522, CA generated and corrected text is presented to the AU asindicated at 1524. Here, all CA generated text previously presented tothe AU persists.

While the CA and AU display screen shots upon caption source switchingare described above in the context of CA initiated caption sourceswitching, it should be appreciated that similar types of switchingnotifications may be presented when an AU initiates the switchingaction. To this end, see, for instance, that in some cases when thesystem is operating as a full CA captioning system as in FIG. 48 , an“ASR-CA Back Up” button 771 is presented that can be selected to switchback to an ASR-CA backed up mode operation in which case a screen shotsimilar to the FIG. 49 shot may be presented to the AU where line 1522delineates the breakpoint between the CA generated initial text aboveand the ASR generated initial text that follows.

As another instance, see that in some cases when the system is operatingas an ASR-CA backed up mode as in FIG. 49 , a “Full CACaptioning/Correction” button 773 is presented that can be selected toswitch back to full CA captioning and correction system operation inwhich case a screen shot similar to the FIG. 48 shot may be presented tothe AU where line 1502 delineates the breakpoint between the ASRgenerated initial text above and the CA generated initial text thatfollows.

In at least some embodiments as the system operates in the ASR-CA backedup mode of operation, as text is presented to a CA to consider the textfor correction, the CA may be limited to only correcting errors thatoccur prior to a current point in the HU voice signal broadcast to theCA. Thus, for instance, referring again to FIG. 23A where a currentlybroadcast HU voice signal word is “restaurant”, CA corrections may belimited to text prior to the word restaurant at 748A so that the CAcannot change any of the words at 769 until after they are broadcast tothe CA.

In at least some embodiments when the system is in the ASR-CA backed upmode, a CA mute feature is enabled whenever the CA has not initiated acorrection action and automatically disengages when the CA initiatescorrection. For instance, referring again to FIG. 50 , assume a CA isreviewing the ASR generated text to identify text errors as she islistening to the HU voice signal broadcast. Here, if the CA selects thewords “Pistol Pals” via touch as indicated at 1560, the selected text isvisually distinguished, the HU voice signal broadcast to the CA halts atthe word “restaurant”, CA keyboard becomes active for enteringcorrection text and the muted CA microphone is activated so that the CAhas the option to enter corrective text either via the keyboard or viathe microphone. In addition, the HU's voice segment including at leastthe annunciation related to the selected words “Pistol Pals” isimmediately rebroadcast to the CA for consideration while viewing thewords “Pistol Pals”. Once the CA corrections are completed, the CAmicrophone is again disabled and the HU voice signal broadcast skipsback to the word “restaurant” where the signal broadcast recommences. Insome cases selection of the phrase “Pistol Pals” may also open a dropdown window with other probable options for that phrase generated by theASR engine or some other processor function where the CA can quicklyselect one of those other options if desired.

In some embodiments when a CA starts to correct a word or phrase in anASR text transcript, once the CA selects the word or phrase forcorrection, a signal may be sent immediately to an AU device causing theword or phrase to be highlighted or otherwise visually distinguished sothat the AU is aware that it is highly likely that the word or phrase isgoing to be changed shortly. In this way, an AU can recognize that aword or phrase in an ASR text transcription is likely wrong and if shewas relying on the text representation to understand what the HU said,she can simply continue to view the highlighted word or phrase until itis modified by the CA or otherwise cleared as accurate.

Under at least some circumstances an ASR engine may lag an HU voicesignal by a relatively long and unacceptable duration. In at least someembodiments it is contemplated that when a relay operates in an ASR-CAbacked up mode (e.g., where the ASR generates initial text forcorrection by a CA), a system processor may track ASR text transcriptionlag time and, under at least certain circumstances, may automaticallyswitch from the ASR backed up mode to a full CA captioning andcorrection mode either for the remainder of a call or for at least someportion of the call. For instance, when an ASR lag time exceeds somethreshold duration (e.g., 1-15 seconds), the processor may automaticallyswitch to the full CA mode for a predetermined duration (e.g., 15seconds) so that a CA can work to eliminate or at least substantiallyreduce the lag time after which the system may again automaticallyrevert back to the ASR-CA backed up mode. As another instance, once thesystem switches to the full CA mode, the system may remain in the fullCA mode while the ASR continues to generate ASR engine text in paralleland a system processor may continue to track the ASR lag time and whenthe lag time drops below the threshold value either for a short durationor for some longer threshold duration of time (e.g., 5 consecutiveseconds), the system may again revert back to the ASR-CA backed upoperating mode. In still other cases where a system processor determinesthat some other communication characteristic (e.g., line quality, noiselevel, etc.) or HU voice signal characteristic (e.g., WPM, slurring ofwords, etc.) is a likely cause of the poor ASR performance, the systemmay switch to full CA mode and maintain that mode until the perceivedcommunication or voice signal characteristic is no longer detected.

In at least some cases where a third party provides ASR engine services,ASR delay can be identified whenever an HU voice signal is sent to theengine and no text is received back for at least some inordinatethreshold of time.

In at least some cases the ASR text transcript lag time that triggers aswitch to a full CA operating mode may be a function of specific skillsor capabilities of a specific CA that would take over full captioningand corrections if a switch over occurs. Here, for instance, given apersistent ASR delay of a specific magnitude, a first CA may be able tobe substantially faster while a second could not so that a switch overto the second CA would only be justifiable if the persistent ASR delaywas much longer. Here it is contemplated that CA profiles will includespeed and accuracy metrics for associated CAs which can be used by thesystem to assess when to change over to the full CA system and when notto change over depending on the CA identity and related metrics.

In at least some embodiments it is contemplated that a relay processormay be programmed to coach a CA on various aspects of her relayworkstation and how to handle calls generally and even specific callswhile the calls are progressing. For instance, in at least some caseswhere a CA determines when to switch from an ASR-CA backed operatingmode to a full CA mode, a system processor may track one or more metricsduring the ASR-CA backed operating mode and compare that metric tometrics for the CA in the CA profile to determine when a full CA modewould be better than the ASR-CA backed mode by at least some thresholdvalue (e.g., 10% faster, 5% more accurate, etc.). Here, instead ofautomatically switching over to the full CA mode when that mode wouldlikely be more accurate and/or faster by the threshold value, aprocessor may present a notice or warning to the CA encouraging the CAto make the switch to full CA mode along with statistics indicating thelikely increase in captioning effectiveness (e.g., 10% faster, 5% moreaccurate). To this end, the exemplary statistics shown at 1541 in FIG.50 that are associated with a “Full CA Captioning/Correction” button.

In a similar fashion, when a CA operates a relay workstation in a fullCA mode, the system may continually track metrics related to the CA'scaptions and compare those to estimated ASR-CA backed up mode estimatesfor the specific CA (e.g., based on the CA's profile performancestatistics) and may coach the CA on when to switch to the ASR-CA backedoperating mode. In this regard, see for instance the speed and accuracystatistics shown at 753 in FIG. 47 that are associated with the ASR-CABack Up button 751.

In at least some embodiments it is contemplated that a CA will be ableto set various station operating parameters to preferred settings thatthe CA perceives to be optimal for the CA while captioning. Forinstance, in cases where a workstation operating mode can be switchedbetween ASR-CA backed and full CA, a CA may be able to turn automaticswitching on or turn that switching off so that a switch only occurswhen the CA selects an on screen or other interface button to make theswitch. As another instance, the CA may be able to specify whether ornot metrics (e.g., speed and accuracy as at 753 in FIG. 47 ) arepresented to the CA to encourage a manual mode switch. As anotherinstance, a CA may be able to adjust a maximum cumulative captioningdelay period that is enforced during calls. As still one other instance,a CA may be able to turn on and off a 2 times or 3 times broadcast ratefeature that kicks in whenever a CA latency value exceeds some thresholdduration. Many other station parameters are contemplated that may be setto different operating characteristics by a CA.

In at least some cases it is contemplated that a system processortracking all or at least a subset of CA statistics for all or at least asubset of CAs may routinely compare CA statistical results to identifyhigh and low performers and may then analyze CA workstation settings toidentify any common setting combinations that are persistentlyassociated with either high or low performers. Once persistent highperformer settings are identified, in at least some cases a systemprocessor may use those settings to coach other CAs and, morespecifically, low performing CAs on best practices. In other cases,persistent high performer settings may be presented to a systemadministrator to show a correlation between those settings andperformance and the administrator may then use those settings to developbest practice materials for training other CAs.

For example, assume that several CAs set workstation parameters suchthat a system processor only broadcasts HU voice signal corresponding tophrases that have confidence factors of 6/10 or less at the HU'sspeaking rate and speeds up broadcast of any HU voice signalcorresponding to phrases that have 7/10 or greater confidence factors to2× the HU's speaking rate. Also assume that these setting result insubstantially faster CA error correction than other station settings. Inthis case, a notice may be automatically generated to lower performingCAs encouraging each to experiment with the expedited broadcast settingsbased on ASR text confidence factors.

Various system gaming aspects have been described above where CAstatistics are presented to a CA to help her improve skills andcaptioning services in a fun way. In some cases it is contemplated thata system processor may routinely compare a specific CA with her ownaverage and best statistics and present that information to th CA eitherroutinely during calls or at the end of each call so that the CA cancompete against her own prior statistics. In some cases two or more CAsmay be pitted against each other sort of like a race to see who cancaption the fastest, correct more errors in a short period of time,generate the most accurate overall caption text, etc. In some cases CAsmay be able to challenge each other and may be presented real timecaptioning statistics during a challenge session where each gets tocompare their statistics to the other CA's real time statistics. To thisend, see the exemplary dual CA statistics shown at 771 in FIG. 47 wherethe statistics shown include average captioning delay, accuracy leveland number of errors corrected for a CA using a station that includesthe display screen 50 and another CA, Bill Blue, captioning andcorrecting at a different station. Leaders in each statistical categoryare visually distinguished. For instance, statistic values that are bestin each category are shown double cross hatched in FIG. 47 to indicategreen highlighting.

While CA call and performance metrics may be textually represented insome cases, in other cases particularly advantageous metric indicatorsmay have at least some graphic characteristics so that metrics can beunderstood based on a simple glance. For instance, see the graphicalperformance representation at 787 in FIG. 47 where arrows 789 thatrepresent instantaneous statistics dynamically float along horizontalaccuracy and speed scales to indicate performance characteristics. Insome cases the graphical characteristics may be calculated relative topersonal averages from a specific CA's profile and in other cases thecharacteristics may be calculated relative to all or a subset of CAsassociated with the system.

In some embodiments it is contemplated that CAs may be automaticallyrewarded for good performance or increases in performance over time. Forinstance, each 2 hours a CA performs at or above some thresholdperformance level, she may be rewarded with a coupon for coffee or someother type of refreshment. As another instance, when a CA's persistenterror correction performance level increases by 5% over time, she may begranted a paid one hour off at the end of the week. As yet one otherinstance, where CA's compete head to head in a captioning and correctingcontest, the winner of a contest may be granted some reward to incentperformance increases over time.

In line error corrections are described above where initial ASR or CAgenerated text is presented to an AU immediately upon being generatedand then when a CA or an ASR corrects an error in the initial text, theerroneous text is replaced “in line” in the text already presented tothe AU. In at least some cases the corrected text is highlighted orotherwise visually distinguished so that an AU can clearly see when texthas been corrected. Major and minor errors are also described where aminor error is one that, while wrong, does not change the meaning of anincluding phrase while a major error does change the meaning of anincluding phrase.

It has been recognized that when text on an AU display screen is changedand visually distinguished often, the cumulative highlighted changes canbe distracting. For this reason, in at least some embodiments it iscontemplated that a system processor may filter CA error corrections andmay only change major errors on an AU display screen so that minorerrors that have no effect on the meaning of including phrases aresimply not shown to the AU. In many cases limiting AU text errorcorrection to major error corrections can decrease in line on screencorrections by 70% or more substantially reducing the level ofdistraction associated with the correction process.

To implement a system where only major errors are corrected on the AUdisplay screen, all CA error corrections may be considered in context bya system processor (e.g., within including phrases) and the processorcan determine if the correction changes the meaning of the includingphrase. Where the correction affects the meaning of the includingphrase, the correction is sent to the AU device along with instructionsto implement an in line correction. Where the correction does not affectthe meaning of the including phrase, the error may simply be disregardedin some embodiments and therefore never sent to the AU device. In othercases where a correction does not affect the meaning of the includingphrase, the error may still be transmitted to the AU device and used tocorrect the error in a call text archive maintained by the AU device asopposed to in the on screen text. In this way, if the AU goes back in acall transcript to review content, all errors including major and minorare corrected.

In other embodiments, instead of only correcting major errors on an AUdevice display screen, all errors may be corrected but the system mayonly highlight or otherwise visually distinguish major errors to reduceerror correction distraction. Here, the thinking is that if and AU caresat all about error corrections, the most important corrections are theones that change the meaning of an including phrase and therefore thosechanges should be visually highlighted in some fashion.

CA Sensors

CA station sensor devices can be provided at CA workstations to furtherenhance a CA's captioning and error correction capabilities. To thisend, in at least some embodiments some type of eye trajectory sensor maybe provided at a CA workstation for tracking the location on a CAdisplay screen that a CA is looking at so that a word or phrase on thescreen at the location instantaneously viewed by the CA can beassociated with the CA's sight. To this end, see, for instance, the CAworkstation 1700 shown in FIG. 54 that includes a display screen 50,keyboard 52 and headphones 54 as described above with respect to FIG. 1. In addition, the station 1700 includes an eye tracking sensor systemthat is represented by numeral 1702 that is directed at a CA's locationat the station and specifically to capture images or video of the CAusing the station. The camera field of view (FOV) is indicated at 1712and is specifically trained on the face of a CA 1710 that currentlyoccupies the station 1700.

Referring still to FIG. 54 and also to FIG. 55 , images from sensor 1702can be used to identify the CA's eyes and, more specifically, thetrajectory of the CA's line of sight as labelled 1714. As best shown inFIG. 55 , the CA's line of sight intersects the display screen 50 at aspecific location where the text word “restaurant” is presented. In someembodiments, as illustrated, the word a CA is currently looking at onthe screen 50 will be visually highlighted or otherwise distinguished asfeedback to the CA indicating where the system senses that the CA islooking. Known eye tracking systems have been developed that generateinvisible bursts of infrared light that reflects differently off astation user's eyes depending on where the user is looking. A camerapicks up images of the reflected light which is then used to determinethe CA's line of sight trajectory. In other cases a CA may wear aheadset that tracks headset orientation in the ambient as well as theCA's pupil to determine the CA's line of sight. Other eye trackingsystems are known in the art and any may be used in various embodiments.

Here, instead of having to move a mouse cursor to a word on the displayscreen or having to touch the word on the screen to select it, a CA maysimply tap a selection button on her keyboard 52 once to select thehighlighted word (e.g., the word subtended by the CA's light of sight)for error correction. In some cases a double tap of the keyboardselection button may cause the entire phrase or several words before andafter the highlighted word to be selected for error correction.

Once a word or phrase is selected for error correction, the current HUvoice signal broadcast 1720A may be halted, the word or phrase selectedmay be differently highlighted or visually distinguished and thenre-broadcast for CA consideration as the CA uses the keyboard ormicrophone to edit the highlighted word or phrase. Once the word orphrase is corrected, the CA can tap an enter key or other keyboardbutton to enter the correction and cause the corrected text to betransmitted to the AU device for in line correction. Once the enter keyis selected, HU voice signal broadcast would recommence at the word 1720where it left off.

In some embodiments the eye tracking feature may be used to monitor CAactivity and, specifically, whether or not the CA is considering alltext generated by an ASR or CA re-voicing software. Here, another metricmay include percent of text words viewed by a CA for error correction,durations of time required to make error corrections, etc.

In at least some embodiments it is contemplated that two or more ASRengines of different types (e.g., developed and operated by differententities) may be available for HU voice signal captioning. In thesecases, it is contemplated that one of the ASR engines may generatesubstantially better captioning results than other engines. In somecases it is contemplated that at the beginning of an AU-HU call, the HUvoice may be presented to two or more ASR engines so that two or more HUvoice signal text transcripts are generated. Here, a CA may correct oneof the ASR text transcripts to generate a “truth” transcript presentedto an AU. Here, the truth transcript may be automatically compared by aprocessor to each of the ASR text transcripts associated with the callto rank the ASR engines best to worst for transcribing the specificcall. Then, the system may automatically start using the best ASR enginefor transcription during the call and may scrap use of the other twoengines for the remainder of the call. In other cases while the otherengines may be disabled, they may be re-enabled if captioning metricsdeteriorate below some threshold level and the process above ofassigning metrics to each engine as text transcripts are generated maybe repeated to identify a current best ASR engine to continue servicingthe call.

In at least some cases one or more biometric sensors may be includedwithin an AU's caption device that can be sued for various purposes. Forinstance, see again FIG. 1 where a camera 75 is included in device 12for obtaining images of an AU using the caption device 12 during a voicecommunication with an HU. Other biometric sensor devices arecontemplated such as, for instance, the microphone in handset 22, afinger print reader 23 on device 12 or handset 22, etc., each of whichmay be sued to confirm AU user identity.

One purpose for camera 75 or another biometric sensor device may be torecognize a specific AU and only allow the captioning service to be usedby a certified hearing impaired AU. Thus, for instance, a softwareapplication run by a processor in device 12 or that is run by the systemserver 30 may perform a face recognition process each time device 12 isactivated, each time any person locates within the field of view ofcamera 75, each time the camera senses movement within its FOV, etc. Inthis case it is contemplated that any AU that is hearing impaired wouldhave to pre-register with the system where the system is initiallyenabled by scanning the AU's face to generate a face recognition modelwhich would be stored for subsequent device enablement processes. Inother cases it is contemplated that hearing specialists of physiciansmay, upon diagnosing an AU with sufficient hearing deficiency to warrantthe captioning service, obtain an image of the AU's face or an entire 3Dfacial model using a smart phone or the like which is uploaded to asystem server 30 and stored with user identification information tofacilitate subsequent facial recognition processes as contemplated here.In this way, AUs that are not comfortable with computers or technologymay be spared the burden of commissioning their caption devices at homewhich, for some, may not be intuitive.

After a caption device is set up and commissioned, once an authorized AUis detected in the camera FOV, device 12 may operate in any of the waysdescribed above to facilitate captioned or non-captioned calls for anAU. Where a person not authorized to use the caption service uses device12 to make a call, device 12 may simply not provide any caption relatedfeatures per the graphical display screen so that device 12 operateslike a normal display based phone device.

In other cases images or video from camera 75 may be provided to an HUor even a CA to give either or both of those people a visualrepresentation of the AU so that each can get a sense from non-verbalqueues of effectiveness of AU communications. When a visualrepresentation of the AU is presented to either or both of the HU andCA, some clear indicator of the visual representation will be given tothe AU such as for instance, a warning message of display 18 of device12. In fact, prior to presenting AU images or video to others, device 12may seek AU authorization in a clear fashion so that the AU is notcaught off guard.

In at least some embodiments described above, ASR or other currentlybest caption text (e.g., CA generated text in a full CA mode ofoperation) is presented immediately or at least substantiallyimmediately to an AU upon generation and subsequently, when an error inthat initial text is corrected, the error is corrected within the textpresented to the AU by replacing the initial erroneous text withcorrected text. To notify the AU that the text has been modified, thecorrected text is highlighted or otherwise visually distinguished inline. It has been recognized that while highlighting or other tagging todistinguish corrected text is useful in most cases, those highlights ortags can become distracting under certain circumstances. For instance,when substantial or frequent error corrections are made, the new texthighlighting can be distracting to an AU participating in a call.

In some cases, as described above, a system processor may be programmedto determine if error corrections result in a change in meaning in anincluding sentence and may only highlight error corrections that aremeaningful (e.g., change the meaning of the included sentence). Here,all error corrections would be made on the AU device display but onlymeaningful error corrections would be highlighted.

In other cases it is contemplated that all error corrections may bevisually distinguished where meaningful corrections are distinguished inone fashion and minor (e.g., not changing meaning of including sentence)error correction are distinguished in a relatively less noticeablefashion. For instance, minor error corrections may be indicated viaitalicizing text swapped into original text while meaningful correctionsare indicated via yellow or green or some other type of highlighting.

In still other cases all error corrections may be distinguishedinitially upon being made but the highlighting or other distinguishingeffect may be modified based on some factor such as time, number ofwords captioned since the error was corrected, number or errorcorrections since the error was corrected, or some combination of thesefactors. For example, an error correction may initially be highlightedbright yellow and, over the next 8 seconds, the highlight may be dimmeduntil it is no longer visually identifiable. As another example, a firsterror correction may be highlighted bright yellow and that highlightingmay persist until each of a second and third error correction thatfollows the first correction is made after which the first errorcorrection highlighting may be completely turned off. As yet one otherinstance, an error correction may be initially highlighted bright yellowand bolded and, after 8 subsequent text words are generated, thehighlighting may be turned off while the bold effect continues. Then,after a next two error corrections are made, the bold effect on thefirst error correction may be eliminated. Many other expiring errorcorrection distinguishing effects are contemplated.

Referring now to FIG. 56 , a screen shot of an AU interface is shownthat may be presented on a caption device display 18 that shows captiontext that includes some errors where a first error is shown corrected at2102 (e.g., the term “Pal's” has been corrected and replaced with“Pete's”). As illustrated the new term “Pete's” is visuallydistinguished in two ways including highlighting and changing the fontto be bold and italic.

Referring also to FIG. 57 , a screen shot similar to the FIG. 56 shot isshown, albeit where a second error (e.g., “John”) has been corrected andreplaced in line with the term “join” 1204. In this example, thecorrection distinguishing rules are that a most recent error correctionis highlighted, bold and italic, a second most recent error correctionis indicated only via bold and italic font (e.g., no highlighting) andthat when two error corrections occur after any error correction, theearliest of those corrections is no longer highlighted (e.g., is shownas regular text). Thus, in FIG. 57 , the error correction at 1202 is nowdistinguished by bold and italic font but is no longer highlighted andthe most recent error correction at 1204 is highlighted and shown viabold and italic font.

Referring to FIG. 58 , a screen shot similar to the FIG. 56 and FIG. 57shots is shown, albeit where a third error (e.g., “rest ant”) has beencorrected and replaced in line with the term “restaurant” 2106.Consistent with the correction distinguishing rules described above, themost recent correction 1206 is shown highlighted, bolded and italic, theprior error correction at 1204 is shown bolded and italic and the errorcorrection at 1202 is shown as normal text with no special effect.

In any case where a second CA is taking over primary captioning fromeither an ASR or a first or initial CA at a specific point in an HUvoice signal, the system may automatically broadcast at least a portionof the HU voice signal that precedes the point at which the second CA istaking over captioning to the second CA to provide context for thesecond CA. For instance, the system may automatically broadcast 7seconds of HU voice signal that precede the point where the second CAtakes over captioning so that when the CA takes over, the CA has contextin which to start captioning the first few words of the HU voice signalto be captioned by the CA. In at least some cases the system may audiblydistinguish HU voice signal provided for context from HU voice signal tobe captioned by the CA so that the CA has a sense of what signal tocaption and which is simply presented as context. For instance, the toneor pitch or rate of broadcast or volume of the contextual HU voicesignal portion may be modified to distinguish that portion of the voicesignal form the signal to be captioned.

Systems have been described above where ongoing calls are automaticallytransferred from a first CA to a second CA based on CA expertise inhandling calls with specific detected characteristics. For instance, acall where an HU has a specific accent may be transferred mid-call to aCA that specializes in the detected accent, a call where a line isparticularly noisy may be transferred to a CA that has scored well interms of captioning accuracy and speed for low audio quality calls, etc.One other call characteristic that may be detected and used to directcalls to specific CAs is call subject matter related to specifictechnical or business fields where specific CAs having expertise inthose fields will typically have better captioning results. In thesecase, in at least some embodiments, a system processor may be programmedto detect specific words or phrases that are tell tail signs that callsubject matter is related to a specific field or discipline handled bestby specific CAs and, once that correlation is determined, an associatedcall may be transferred from an initial CA to a second CA thatspecializes in captioning that specific subject matter.

In some cases an AU may work in a specific field in which the AU andmany HUs that the AU converses with use complex field specificterminology. Here, a system processor may be programmed to learn overtime that the AU is associated with the specific field based onconversation content (e.g., content of the HU voice signal and, in somecases, content of an AU voice signal) and, in addition to generating anutterance and text word dictionary for an AU, may automaticallyassociate specific CAs that specialize in the field with any callinvolving the AU's caption device (as identified by the AU's phonenumber or caption device address). For instance, if an AU is aneuroscientist and routinely participates in calls with industrycolleagues using complex industry terms, a system processor mayrecognize the terms and associate the terms and AU with an associatedindustry. Here, specific CAs may be associated with the neuroscienceindustry and the system may associate those CAs with the calling numberof the AU so that going forward, all calls involving the AU are assignedto CAs specializing in the associated industry whenever one of those CAsis available. If a specialized CA is not available at the beginning of acall involving the AU, the system may initiate captioning using a firstCA and then once a specialized CA becomes available, may transfer thecall to the available CA to increase captioning accuracy, speed or both.

In some cases it is contemplated that an AU may specify a specific fieldor fields that the AU works in so that the system can associate the AUwith specific CAs that specialize in captioning for that field or thosefields. For instance, in the above example, a neuroscientist AU mayspecify neuroscience as her field during an caption device commissioningprocess and the system may then associate ten different CAs thatspecialize in calls involving terminology in the field of neurosciencewith the AU's caption device. Thereafter, when the AU participates in acall and requires CA captioning, the call may be linked to one of theassociated specialized CAs when one is available.

In some embodiments it is contemplated that a system may track AUinteraction with her caption device and may generate CAS preference databased on that interaction that can be used to select or avoid specificCAs in the future. For instance, where an AU routinely indicates thatthe captioning procedure handled by a specific CA should be modified,once a trend associated with the specific CA for the specific AU isidentified, the system may automatically associate the CA with a list ofCAs that should not be assigned to handle calls for the AU.

In some cases it is contemplated that the system may enable an AU toindicate perceived captioning quality at the end of each call or at theend of specific calls based on caption confidence factors or some othermetric(s) so that the AU can directly indicate a non-preference for CAs.Similarly, an AU may be able to indicate a preference for a specific CAor that a particular caption session was exceptionally good in whichcase the CA may be added to a list of preferred CAs for the AU. In thesecases, calls with the AU would be assigned to preferred CAs and notassigned to CAs on the non-preferred list. Here, at the end of each of asubset of calls, an AU may be presented with touch selectable icons(e.g., “Good Captioning”; “Unsatisfactory Captioning”) enabling the AUto indicate satisfaction level for captioning service related to thecall.

While embodiments are described above where specific CAs are associatedwith preferred and non-preferred lists or optimal and non-optimal listsfor specific AUs, it should be appreciated that the similar preferencesor optimality ratings may be ascribed to different captioning processes.For instance, a first AU may routinely rank ASR captioning poorly butfull CA captioning highly and, in that case, the system mayautomatically configure so that all calls for the first AU are handledvia full CA captioning. For a second AU, the system may automaticallygenerate caption confidence factors and use those factors to determinethat the mix of captioning speed and accuracy is almost always best wheninitial captions are generated via an ASR system and one of 25 CAs thatare optimal for the second AU is assigned to perform error correctionson the initial caption text.

To apprise the public of the scope of the present invention thefollowing claims are made.

1. A relay for captioning a hearing user's (H U′s) voice signal during aphone call between an HU and a hearing assisted user (AU), the HU usingan HU device and the AU using an AU device including an AU devicedisplay screen, where the HU voice signal is transmitted from the HUdevice to the AU device, the relay completely separate from the AUdevice and comprising: a relay display screen; a processor linked to therelay display screen and programmed to perform the steps of: receivingthe HU voice signal from the AU device, wherein the AU device receivedthe HU voice signal from the HU device; transmitting the HU voice signalto a remote automatic speech recognition (ASR) server running ASRsoftware that converts the HU voice signal to ASR generated text, theremote ASR server located at a remote location from the relay; receivingthe ASR generated text from the ASR server; present the ASR generatedtext for viewing by a call assistant (CA) via the relay display; andtransmitting the ASR generated text to the AU device immediately uponreceiving the ASR generated text from the ASR.
 2. The relay of claim 1further including an interface that enables a CA to make changes to theASR generated text presented on the relay display.
 3. The relay of claim2 wherein the processor is further programmed to transmit CA correctionsmade to the ASR generated text to the AU device with instructions tomodify the ASR generated text previously sent to the AU device.
 4. Therelay of claim 1 wherein the relay separates the HU voice signal intovoice signal slices, the step of transmitting the HU voice signal to theASR server includes independently transmitting the voice signal slicesto the remote ASR server for captioning and wherein the step ofreceiving the ASR generated text from the relay includes receivingseparate ASR generated text segments for each of the slices and cobblingthe separate segments together to form a stream of ASR generated text.5. The relay of claim 4 wherein at least some of the voice signal slicesoverlap.
 6. The relay of claim 4 wherein at least some of the voicesignal slices are relatively short and some of the voice signal slicesare relatively long and wherein the short voice signal slices areconsecutive and do not overlap and wherein at least some relatively longvoice signal slices overlap at least first and second of the relativelyshort voice signal slices.
 7. The relay of claim 5 wherein at least someof the ASR generated text associated with overlapping voice signalslices is inconsistent, the relay applying a rule set to identify whichinconsistent ASR generated text to use in the stream of ASR generatedtext.
 8. The relay of claim 1 wherein the ASR server generates ASR errorcorrections for the ASR generated text, the relay further programmed toperform the steps of receiving ASR error corrections, using the errorcorrections to automatically correct at least some of the errors in theASR generated text on the relay display screen and transmitting the ASRerror corrections to the AU device.
 9. The relay of claim 8 furtherincluding an interface that enables a CA to make changes to the ASRgenerated text presented on the relay display screen, the processorfurther programmed to transmit CA corrections made to the ASR generatedtext to the AU device with instructions to modify the ASR generated textpreviously sent to the AU device.
 10. The relay of claim 9 wherein,after a CA makes a change to ASR generated text, the text prior theretobecomes firm so that no ASR error corrections are made to the textsubsequent thereto.
 11. The relay of claim 1 wherein the relay furtherincludes a speaker and wherein the processor broadcasts the HU voicesignal to the CA via the speaker as the ASR generated text is presentedon the relay display screen.
 12. The relay of claim 11 wherein theprocessor aligns broadcast of the HU voice signal with ASR generatedtext presented on the display screen.
 13. The relay of claim 11 whereinthe processor presents the ASR generated text on the on the displayscreen immediately upon reception and transmits the ASR generated textimmediately upon reception and broadcasts the HU voice signal undercontrol of the CA using an interface.
 14. The relay of claim 13 wherein,as word in the HU voice signal is broadcast to the CA, textcorresponding to the broadcast word in on the display screen is visuallydistinguished from other text on the display screen.
 15. A relay forcaptioning a hearing user's (H U′s) voice signal during a phone callbetween an HU and a hearing assisted user (AU), the HU using an HUdevice and the AU using an AU device including an AU device displayscreen where the HU voice signal is transmitted from the HU device tothe AU device, the relay comprising: a relay display screen; aninterface device; a processor linked to the relay display screen and theinterface device, the processor programmed to perform the steps of:receiving the HU voice signal from the AU device, wherein the AU devicereceived the HU voice signal from the HU device; separating the HU voicesignal into voice signal slices; separately transmitting the HU voicesignal slices to a remote automatic speech recognition (ASR) server thatis located at a remote location from the relay; receiving separate ASRgenerated text segments for each of the slices and cobbling the separatesegments together to form a stream of ASR generated text; present thestream of ASR generated text as it is received from the ASR server forviewing by a call assistant (CA) via the display; and transmitting thestream of ASR generated text to the AU device as the stream is receivedfrom the ASR server.
 16. The relay of claim 15 wherein ASR errorcorrections to the ASR generated text are received from the ASR serverand at least some of the ASR error corrections are used to correct thetext on the display, the relay receives CA error corrections to the texton the display and uses those corrections to correct text on thedisplay. Inventors: Robert M. Engelke Serial No.: 17/847,809 AmendmentPage 6
 17. The relay of claim 16 wherein, once a CA corrects an error inthe text on the display, ASR error corrections for text prior to the CAcorrected text on the display are not used to make error corrections onthe display.
 18. The relay of claim 17 wherein all ASR generated textpresented on the display is transmitted to the AU device and all ASRerror corrections and CA text corrections that are presented on thedisplay are transmitted as correction text to the AU device.
 19. Ancaption device for use by a hard of hearing assisted user (AU) to assistthe AU during voice communications with a hearing user (HU) using an HUdevice, the caption device comprising: a display screen; a memory; atleast one communication link element for linking to a communicationnetwork; a speaker; a processor linked to each of the display screen,the memory, the speaker and the communication link, the processorprogrammed to perform the steps of: receiving an HU voice signal fromthe HU device during a call; broadcasting the HU voice signal to the AUvia the speaker; storing at least a most recent portion of the HU voicesignal in the memory prior to receiving a command from the AU to start acaptioning session; receiving a command from the AU to start acaptioning session; upon receiving the command, obtaining a text captioncorresponding to the stored HU voice signal; and presenting the textcaption to the AU via the display.
 20. The device of claim 19 whereinthe step of obtaining a text caption includes initiating a processwhereby an automated speech recognition (ASR) program converts thestored HU voice signal to text.
 21. The device of claim 20 wherein theprocessor runs the ASR program.
 22. The device of claim 21 wherein thestep of initiating the process includes establishing a link to a remoterelay, and transmitting the stored HU voice signal to the relay, thestep of obtaining further including receiving the text caption from therelay.
 23. The device of claim 19 further including, subsequent toreceiving the command, obtaining text captions for additional HU voicesignals received during the ongoing call.
 24. The device of claim 23wherein the step of obtaining text caption of the stored HU voice signalincludes initiating a process whereby the HU voice signal is convertedto text via an automatic speech recognition (ASR) engine and wherein thestep of obtaining text captions form additional HU voice signal receivedduring the ongoing call further includes transmitting the additional HUvoice signal to a relay and receiving text captions back from the relay.